This chapter covers
- Organizing and processing data in your cloud data platform
- Understanding the different stages of data processing
- Discussing the rationale for separating storage from compute
- Organizing data in cloud storage and designing a data flow
- Implementing common data processing patterns
- Choosing the right file formats for archive, staging, and production
- Creating a single parameter-driven pipeline with common data transformations
We will introduce a number of concepts, such as the difference between common data processing steps (such as file format conversion, deduplication, and schema management) versus custom business logic (such as the rules each company chooses to apply to transform their data for a unique use case).
We will walk through how to organize your data in storage, following the data journey through landing, archiving, staging, and production areas. We’ll explain the importance of using batch identifiers to make it simpler to trace the data journey through the storage areas and the warehouse and make debugging and lineage tracking easier.
We will talk about the use of different file formats for the different storage areas and the importance of standardizing on binary formats in staging and production for compression, performance, and common schemas.