12 Transforming your data
This chapter covers
- Learning the data transformation process
- Performing record-level data transformation
- Learning data discovery and data mapping
- Implementing a data transformation process on a real-world dataset
- Verifying the result of data transformations
- Joining datasets to get richer data and insights
This chapter is probably the cornerstone of the book. All the knowledge you gathered through the first 11 chapters has brought you to these key questions: Once I have all this data, how can I transform it, and what can I do with it?
Apache Spark is all about data transformation, but what precisely is data transformation? How can you perform such transformations in a repeatable and procedural way? Think of it as an industrial process that will ensure that your data is adequately and reliably transformed.
You will then perform record-level transformation: manipulating the data at an atomic level, cell by cell, column by column. To perform your labs, you will use the US Census Bureau’s report of population in all the counties of all the states and territories of the United States. You will extract information so you can build a different dataset.