6 Multi-dimensional data frames: using PySpark with JSON data

This chapter covers

What JSON data is and how we can draw parallels between JSON documents and Python data structures.
Ingesting JSON data within a data frame.
Representing hierarchical data in a data frame through complex column types.
Reducing duplication and reliance on auxiliary tables with a document/hierarchical data model.
Creating and unpacking data from complex data types.

So far, we have used PySpark’s data frame to work with textual (chapters 2 and 3) and tabular (chapters 4 and 5). Both formats were pretty different but they fit seamlessly into the data frame structure. I believe we’re ready to push the abstraction a little further by representing hierarchical information within a data frame.

Imagine it a moment. Columns within columns. The ultimate flexibility.

This chapter is about ingesting and working with hierarchical JSON data, using the PySpark data frame. I introduce the JSON format and how we can draw parallels to Python data structures. I go over the three container structures available for the data frame, the array, the map, and the struct, and how they are used to represent richer data layouts. I cover how we can use them to represent multidimensional data, and how the struct can represent hierarchical information. Finally, I wrap that information into a schema, a very useful construct for documenting what’s in your data frame.

6.1 Reading JSON data: getting ready for the schemapocalypse

6.1.1 Starting small: JSON data as a limited Python dictionary

6.1.2 Going bigger: reading JSON data in PySpark

6.2 Breaking the second dimension with complex data types

6.2.1 When you have more than one value: the array

6.2.2 The map type: keys and values within a column

6.3 The struct: nesting columns within columns

6.3.1 Navigating structs as if they were nested columns

6.4 Building and using the data frame schema

6.4.1 Using Spark types as the base blocks of a schema

6.4.2 Reading a JSON document with a strict schema in place

6.4.3 Going full circle: specifying your schemas in JSON

6.5 Putting it all together: reducing duplicate data with complex data types

6.5.1 Getting to the "just right" data frame: explode and collect

6.5.2 Building your own hierarchies: struct as a function

6.6 Summary

6.7 Exercises

6.7.1 Exercise 6.21

6.7.2 Exercise 6.22

6.7.3 Exercise 6.23