
Welcome
Thank you for purchasing the MEAP for Data Analysis with Python and PySpark. It is a lot of fun (and work!) and I hope you’ll enjoy reading it as much as I am enjoying writing the book.
My journey with PySpark is pretty typical: the company I used to work for migrated their data infrastructure to a data lake and realized along the way that their usual warehouse-type jobs didn’t work so well anymore. I spent most of my first months there figuring out how to make PySpark work for my colleagues and myself, starting from zero. This book is very influenced by the questions I got from my colleagues and students (and sometimes myself). I’ve found that combining practical experience through real examples with a little bit of theory brings not only proficiency in using PySpark, but also how to build better data programs. This book walks the line between the two by explaining important theoretical concepts without being too laborious.
This book covers a wide range of subjects, since PySpark is itself a very versatile platform. I divided the book into three parts.