This chapter covers:
- How PySpark encodes pieces of data inside columns, and how their type conveys meaning about what operations you can perform on a given column.
- What kind of types PySpark provides, and how they relate to Python’s type definition.
- How PySpark can represent multi-dimensional data using compound types.
- How PySpark structures columns inside a data frame, and how you can provide a schema to manage said structure.
- How to transform the type of a column and what are the implications of doing so.
- How PySpark treats null values and how you can work with them.
Data is beautiful.
We give data physical qualities like "beautiful", "tidy" or "ugly", but it doesn’t have the same definition there as it would have for a physical object. The same aspect applies to the concept of "data quality": what makes a high-quality data set?
This Chapter will focus on bringing meaning to the data you ingest and process. While we can’t always explain everything in data just by looking at it, just peeking at how some data is represented can lay the foundation of a successful data product. We will look at how PySpark organizes data within a data frame to accommodate a wide variety of use-cases. We’ll talk about data representation though types, how they can guide our operations and how to avoid common mistakes when working with then.