This chapter covers
- What JSON data is and how we can draw parallels between JSON documents and Python data structures.
- Ingesting JSON data within a data frame.
- Representing hierarchical data in a data frame through complex column types.
- Reducing duplication and reliance on auxiliary tables with a document/hierarchical data model.
- Creating and unpacking data from complex data types.
So far, we have used PySpark’s data frame to work with textual (chapters 2 and 3) and tabular (chapters 4 and 5). Both formats were pretty different but they fit seamlessly into the data frame structure. I believe we’re ready to push the abstraction a little further by representing hierarchical information within a data frame.
Imagine it a moment. Columns within columns. The ultimate flexibility.
This chapter is about ingesting and working with hierarchical JSON data, using the PySpark data frame. I introduce the JSON format and how we can draw parallels to Python data structures. I go over the three container structures available for the data frame, the array, the map, and the struct, and how they are used to represent richer data layouts. I cover how we can use them to represent multidimensional data, and how the struct can represent hierarchical information. Finally, I wrap that information into a schema, a very useful construct for documenting what’s in your data frame.