12 Transforming your data

 

This chapter covers

  • Learning the data transformation process
  • Performing record-level data transformation
  • Learning data discovery and data mapping
  • Implementing a data transformation process on a real-world dataset
  • Verifying the result of data transformations
  • Joining datasets to get richer data and insights

This chapter is probably the cornerstone of the book. All the knowledge you gathered through the first 11 chapters has brought you to these key questions: “Once I have all this data, how can I transform it, and what can I do with it?”

Apache Spark is all about data transformation, but what precisely is data transformation? How can you perform such transformations in a repeatable and procedural way? Think of it as an industrial process that will ensure that your data is adequately and reliably transformed.

In this chapter, you will perform record-level transformation: manipulating the data at an atomic level, cell by cell, column by column. To perform your labs, you will use the US Census Bureau’s report of population in all the counties of all the states and territories of the United States. You will extract information so you can build a different dataset.

12.1 What is data transformation?

12.2 Process and example of record-level transformation

12.2.1 Data discovery to understand the complexity

12.2.2 Data mapping to draw the process

12.2.3 Writing the transformation code

12.2.4 Reviewing your data transformation to ensure a quality process

What about sorting?

Wrapping up your first Spark transformation

12.3 Joining datasets

12.3.1 A closer look at the datasets to join

12.3.2 Building the list of higher education institutions per county

12.3.3 Performing the joins

12.4 Performing more transformations

Summary