This chapter covers
- Drawing parallels between JSON documents and Python data structures
- Ingesting JSON data within a data frame
- Representing hierarchical data in a data frame through complex column types
- Reducing duplication and reliance on auxiliary tables with a document/hierarchical data model
- Creating and unpacking data from complex data types
Thus far, we have used PySpark’s data frame to work with textual (chapters 2 and 3) and tabular (chapters 4 and 5) data. Both data formats were pretty different, but they fit seamlessly into the data frame structure. I believe we’re ready to push the abstraction a little further by representing hierarchical information within a data frame. Imagine it for a moment: columns within columns, the ultimate flexibility.