So far, we have used PySpark’s data frame to work with textual (chapter 2 and 3) and tabular (chapter 4 and 5). Both formats are for the most part bi-dimenstional, meaning that we have rows and columns filled with data. PySpark represents data in many types — strings, numbers, even array/lists — within its cells. What if a column could contain a column?
Columns within columns. The ultimate flexibility.
This chapter is about ingesting and working with JSON data, using the PySpark data frame. I introduce the JSON format and how we can draw parallels to Python data structures. I then quickly review the scalar data types we use in PySpark and how they are used for encoding data within a column. I go over the three container structures available for the data frame: the array, the map, and the struct. I cover how we can use them to represent multidimensional data, and how the struct can represent hierarchical information. Finally, I wrap that information into a schema, a very useful construct for documenting what’s in your data frame.
Every data processing job in PySpark starts with data ingestion and JSON documents are no exception. This section explains what is JSON, how to use the specialized JSON reader with PySpark and how a JSON file is represented within a data frame.