Chapter 8. ML: classification and clustering

This chapter covers

The Spark ML library
Logistic regression
Decision trees and random forests
K-means clustering

In the previous chapter, you got acquainted with Spark MLlib (Spark’s machine learning library), with machine learning in general, and linear regression, the most important method of regression analysis. In this chapter, we’ll cover two equally important fields in machine learning: classification and clustering.

Classification is a subset of supervised machine learning algorithms, where the target variable is a categorical variable, which means it takes only a limited set of values. So the task of classification is to categorize input examples into several classes. Recognizing handwritten letters is a classification problem, for example, because each input image needs to be labeled as one of the letters in an alphabet. Recognizing a sickness a patient may have, based on their symptoms, is a similar problem.

Clustering also groups input data into classes (called clusters), but as an unsupervised learning method, it has no properly labeled data to learn from and has to figure out on its own what constitutes a cluster. You could, for example, use clustering for grouping clients by their habits or characteristics (client segmentation) or recognizing different topics in news articles (text categorization).

Chapter 8. ML: classification and clustering

This chapter covers

8.1. Spark ML library

8.2. Logistic regression

8.3. Decision trees and random forests

8.4. Using k-means clustering

8.5. Summary