Part 2. Meet the Spark family

It’s time to get to know the other components that make up Spark: Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX. You’ve already made a brief acquaintance of Spark SQL in chapter 3. In chapter 5, you’ll be formally introduced. You’ll learn how to create and use DataFrames, how to use SQL to query DataFrame data, and how to load data to and save it from external data sources. You’ll also learn about optimizations done by Spark’s SQL Catalyst optimization engine and about performance improvements introduced with the Tungsten project.

Spark Streaming, one of the more popular family members, is introduced in chapter 6. There you’ll learn about discretized streams, which periodically produce RDDs as the streaming application is running. You’ll also learn how to save computation state over time and how to use window operations. We’ll examine ways of connecting to Kafka and how to obtain good performance from your streaming jobs.

Chapters 7 and 8 are about machine learning, specifically about the Spark MLlib and Spark ML sections of Spark API. You’ll learn about machine learning in general and about linear regression, logistic regression, decision trees, random forests, and k-means clustering. Along the way, you’ll scale and normalize features, use regularization, and train and evaluate machine-learning models. We’ll explain the API standardizations brought by Spark ML.