Chapter 7. Getting smart with MLlib

This chapter covers

Machine-learning basics
Performing linear algebra in Spark
Scaling and normalizing features
Training and applying a linear regression model
Evaluating the model’s performance
Using regularization
Optimizing linear regression

Machine learning is a scientific discipline that studies the use and development of algorithms that make computers accomplish complicated tasks without explicitly programming them. That is, the algorithms eventually learn how they can solve a given task. These algorithms include methods and techniques from statistics, probability, and information theory.

Today, machine learning is ubiquitous. Examples include online stores that offer you similar items that other users have viewed or bought, email clients that automatically move emails to spam, advances in autonomous driving recently developed by several car manufacturers, and speech and video recognition. It’s also becoming a big part of online business: finding hidden relationships in user habits and actions (and learning from them) can bring critical added value to existing products and services.

But with the advent of companies handling huge amounts of data (known as big data), more scalable machine-learning packages are needed. Spark provides distributed and scalable implementations of various machine-learning algorithms and makes it possible to handle those continuously growing datasets.^[1]

Chapter 7. Getting smart with MLlib

This chapter covers

7.1. Introduction to machine learning

7.2. Linear algebra in Spark

7.3. Linear regression

7.4. Analyzing and preparing the data

7.5. Fitting and using a linear regression model

7.6. Tweaking the algorithm

7.7. Optimizing linear regression

7.8. Summary