This chapter covers
This chapter is probably the cornerstone of the book. All the knowledge you gathered through the first 11 chapters has brought you to these key questions: “Once I have all this data, how can I transform it, and what can I do with it?”
Apache Spark is all about data transformation, but what precisely is data transformation? How can you perform such transformations in a repeatable and procedural way? Think of it as an industrial process that will ensure that your data is adequately and reliably transformed.
In this chapter, you will perform record-level transformation: manipulating the data at an atomic level, cell by cell, column by column. To perform your labs, you will use the US Census Bureau’s report of population in all the counties of all the states and territories of the United States. You will extract information so you can build a different dataset.