chapter thirteen

13 Transforming entire documents

 

This chapter covers

  • Transforming entire documents for better analytics or condensed documents.
  • Navigating the catalog of static functions.
  • Using static functions for data transformation.

This chapter is focusing on the transformation of entire documents: Spark will ingest a complete document, transform it and make it available in another format.

In the previous chapter, you read about data transformations. The next logical step is to transform entire documents and their structure. As an example, JSON is great for transporting data, but a real pain when you have to traverse it to do some analytics. In a similar way, joined datasets have so much data redundancy that it is painful to have a synthetic view. Apache Spark can help with those cases.

Before I wrap up the chapter, I’ll teach you a bit more about all those static functions Spark offers for data transformation. There are so many of them that if I wanted to have an example for each of them, it’ll take another book! Therefore, I want you to have the tools to navigate in them. Appendix K will then be your companion.

Finally, I will point you to more transformations present in the repository, but not described in the book.

As in previous chapters, I believe that using real-life datasets from official sources help understand the concepts more thoroughly. In this chapter, I also used simplified datasets where it made sense.

13.1 Transforming entire documents and their structure

13.1.1 Flattening your JSON document

13.1.2 Building nested documents for transfer and storage

13.2 The magic behind static functions

13.3 Performing more transformations

13.4 Summary