1 Introduction


This chapter covers

  • What PySpark is
  • Why PySpark is a useful tool for analytics
  • The versatility of the Spark platform and its limitations
  • PySpark’s way of processing data

According to pretty much every news outlet, data is everything, everywhere. It’s the new oil, the new electricity, the new gold, plutonium, even bacon! We call it powerful, intangible, precious, dangerous. At the same time, data itself is not enough: it is what you do with it that matters. After all, for a computer, any piece of data is a collection of zeroes and ones, and it is our responsibility, as users, to make sense of how it translates to something useful.

Just like oil, electricity, gold, plutonium, and bacon (especially bacon!), our appetite for data is growing. So much, in fact, that computers aren’t following. Data is growing in size and in complexity, yet consumer hardware has been stalling a little. RAM is hovering for most laptops at around 8 to 16 GB, and SSDs are getting prohibitively expensive past a few terabytes. Is the solution for the burgeoning data analyst to triple-mortgage their life to afford top-of-the-line hardware to tackle big data problems?

1.1 What is PySpark?

1.1.1 Taking it from the start: What is Spark?

1.1.2 PySpark = Spark + Python

1.1.3 Why PySpark?

1.2 Your very own factory: How PySpark works

1.2.1 Some physical planning with the cluster manager

1.2.2 A factory made efficient through a lazy leader

1.3 What will you learn in this book?

1.4 What do I need to get started?