6 Multi-dimentional data frames: using PySpark with JSON data


So far, we have used PySpark’s data frame to work with textual (chapter 2 and 3) and tabular (chapter 4 and 5). Both formats are for the most part bi-dimenstional, meaning that we have rows and columns filled with data. PySpark represents data in many types — strings, numbers, even array/lists — within its cells. What if a column could contain a column?

Columns within columns. The ultimate flexibility.

This chapter is about ingesting and working with JSON data, using the PySpark data frame. I introduce the JSON format and how we can draw parallels to Python data structures. I then quickly review the scalar data types we use in PySpark and how they are used for encoding data within a column. I go over the three container structures available for the data frame: the array, the map, and the struct. I cover how we can use them to represent multidimensional data, and how the struct can represent hierarchical information. Finally, I wrap that information into a schema, a very useful construct for documenting what’s in your data frame.

6.1  Reading JSON data: getting ready for the schemapocalypse

Every data processing job in PySpark starts with data ingestion and JSON documents are no exception. This section explains what is JSON, how to use the specialized JSON reader with PySpark and how a JSON file is represented within a data frame.

6.1.1  Starting small: JSON data as Python dictionary

6.1.2  Going bigger: reading JSON data in PySpark

6.2  Breaking the second dimension with complex data types

6.2.1  When you have more than one value: the array

6.2.2  The map type: keys and values within a column

6.3  The struct: nesting columns within colums

6.3.1  Navigating structs as if they were nested columns

6.4  Building and using the data frame schema

6.4.1  Using Spark types as the base blocks of a schema

6.4.2  Reading a JSON document with a strict schema in place

6.4.3  Going full circle: specifying your schemas in JSON

6.5  Putting it all together: reducing duplicate data with complex data types

6.6  Summary

6.7  Exercises