This chapter covers
This chapter focuses on the transformation of entire documents: Spark will ingest a complete document, transform it, and make it available in another format.
In the previous chapter, you read about data transformations. The next logical step is to transform entire documents and their structure. As an example, JSON is great for transporting data, but a real pain when you have to traverse it to do analytics. In a similar way, joined datasets have so much data redundancy that it is painful to have a synthetic view. Apache Spark can help with those cases.
Before I wrap up the chapter, I’ll teach you a bit more about all those static functions Spark offers for data transformation. There are so many of them that giving you an example for each would require another book! Therefore, I want you to have the tools to navigate them. Appendix G will be your companion.
Finally, I will point you to more transformations that are present in the repository but not described in the book.