13. Transforming entire documents

 

This chapter covers

  • Transforming entire documents for better analytics or condensing
  • Navigating the catalog of static functions
  • Using static functions for data transformation

This chapter focuses on the transformation of entire documents: Spark will ingest a complete document, transform it, and make it available in another format.

In the previous chapter, you read about data transformations. The next logical step is to transform entire documents and their structure. As an example, JSON is great for transporting data, but a real pain when you have to traverse it to do analytics. In a similar way, joined datasets have so much data redundancy that it is painful to have a synthetic view. Apache Spark can help with those cases.

Before I wrap up the chapter, I’ll teach you a bit more about all those static functions Spark offers for data transformation. There are so many of them that giving you an example for each would require another book! Therefore, I want you to have the tools to navigate them. Appendix G will be your companion.

Finally, I will point you to more transformations that are present in the repository but not described in the book.

As in previous chapters, I believe that using real-life datasets from official sources will help you understand the concepts more thoroughly. In this chapter, I also used simplified datasets where it made sense.

13.1 Transforming entire documents and their structure

13.1.1 Flattening your JSON document

13.1.2 Building nested documents for transfer and storage

13.2 The magic behind static functions

13.3 Performing more transformations

Summary