chapter six

6 Making sense of your data: types, structure and semantic

This chapter covers:

How PySpark encodes pieces of data inside columns, and how their type conveys meaning about what operations you can perform on a given column.
What kind of types PySpark provides, and how they relate to Python’s type definition.
How PySpark can represent multi-dimensional data using compound types.
How PySpark structures columns inside a data frame, and how you can provide a schema to manage said structure.
How to transform the type of a column and what are the implications of doing so.
How PySpark treats null values and how you can work with them.

Data is beautiful.

We give data physical qualities like "beautiful", "tidy" or "ugly", but it doesn’t have the same definition there as it would have for a physical object. The same aspect applies to the concept of "data quality": what makes a high-quality data set?

This Chapter will focus on bringing meaning to the data you ingest and process. While we can’t always explain everything in data just by looking at it, just peeking at how some data is represented can lay the foundation of a successful data product. We will look at how PySpark organizes data within a data frame to accommodate a wide variety of use-cases. We’ll talk about data representation though types, how they can guide our operations and how to avoid common mistakes when working with then.

6.1 Open sesame: what does your data tell you?

6.2 The first step in understanding our data: PySpark’s scalar types

6.2.1 String and bytes

6.2.2 The numerical tower(s): integer values

6.2.3 The numerical tower(s): double, floats and decimals

6 Making sense of your data: types, structure and semantic

This chapter covers:

6.1 Open sesame: what does your data tell you?

6.2 The first step in understanding our data: PySpark’s scalar types

6.2.1 String and bytes

6.2.2 The numerical tower(s): integer values

6.2.3 The numerical tower(s): double, floats and decimals

6.2.4 Date and timestamp

6.2.5 Null and boolean

6.3 PySpark’s complex types

6.3.1 Complex types: the array

6.3.2 Complex types: the map

6.4 Structure and type: The dual-nature of the struct

6.4.1 A data frame is an ordered collection of columns

6.4.2 The second dimension: just enough about the row

6.4.3 Casting your way to sanity

6.4.4 Defaulting values with fillna

6.5 Summary

6.6 Exercises