6 Multidimensional data frames: Using PySpark with JSON data

This chapter covers

Drawing parallels between JSON documents and Python data structures
Ingesting JSON data within a data frame
Representing hierarchical data in a data frame through complex column types
Reducing duplication and reliance on auxiliary tables with a document/hierarchical data model
Creating and unpacking data from complex data types

Thus far, we have used PySpark’s data frame to work with textual (chapters 2 and 3) and tabular (chapters 4 and 5) data. Both data formats were pretty different, but they fit seamlessly into the data frame structure. I believe we’re ready to push the abstraction a little further by representing hierarchical information within a data frame. Imagine it for a moment: columns within columns, the ultimate flexibility.

6.1 Reading JSON data: Getting ready for the schemapocalypse

6.1.1 Starting small: JSON data as a limited Python dictionary

6.1.2 Going bigger: Reading JSON data in PySpark

6.2 Breaking the second dimension with complex data types

6.2.1 When you have more than one value: The array

6.2.2 The map type: Keys and values within a column

6.3 The struct: Nesting columns within columns

6.3.1 Navigating structs as if they were nested columns

6.4 Building and using the data frame schema

6.4.1 Using Spark types as the base blocks of a schema

6.4.2 Reading a JSON document with a strict schema in place

6.4.3 Going full circle: Specifying your schemas in JSON

6.5 Putting it all together: Reducing duplicate data with complex data types

6.5.1 Getting to the “just right” data frame: Explode and collect

6.5.2 Building your own hierarchies: Struct as a function

Summary

Additional exercises

Exercise 6.4

Exercise 6.5

Exercise 6.6