part three

Part 3. Get confident: Using machine learning with PySpark

Parts 1 and 2 were all about data transformation, but we’re going to go above and beyond that by tackling scalable machine learning in part 3. While not a complete treatment of machine learning in itself, this part will give you the foundation to write your own ML programs in a robust and repeatable fashion.

Chapter 12 sets the stage for machine learning by building features, curated bits of information to use for the training process. Feature engineering itself is akin to purposeful data transformation. Get ready to use the skills learned in parts 1 and 2!

Chapter 13 introduces ML pipelines, Spark’s way to encapsulate ML workflows in a robust and repeatable way. Now, more importantly than ever, good code structure makes or breaks ML programs, so this tool will keep you sane as you build your models.

Finally, chapter 14 extends the ML pipeline abstraction by creating our own components. With this, your ML workflows will be infinitely versatile without compromising robustness and predictability.

At the end of part 3, you’ll be ready to scale your ML programs. Bring in the big data—time for some big insights!