chapter five

5 Organizing and processing data

This chapter covers

Organizing and processing data in your cloud data platform
Understanding the different stages of data processing
Discussing the rationale for separating storage from compute
Organizing data in cloud storage and designing a data flow
Implementing common data processing patterns
Choosing the right file formats for archive, staging, and production
Creating a single parameter-driven pipeline with common data transformations

We will introduce a number of concepts, such as the difference between common data processing steps (such as file format conversion, deduplication, and schema management) versus custom business logic (such as the rules each company chooses to apply to transform their data for a unique use case).

We will walk through how to organize your data in storage, following the data journey through landing, archiving, staging, and production areas. We’ll explain the importance of using batch identifiers to make it simpler to trace the data journey through the storage areas and the warehouse and make debugging and lineage tracking easier.

We will talk about the use of different file formats for the different storage areas and the importance of standardizing on binary formats in staging and production for compression, performance, and common schemas.

Last, we’ll explain how we can scale our common data processing by designing flexible and configurable pipelines, using orchestration.

5.1 Processing as a separate layer in the data platform

5 Organizing and processing data

This chapter covers

5.1 Processing as a separate layer in the data platform

5.2 Data processing stages

5.3 Organizing your cloud storage

5.3.1 Cloud storage containers and folders

5.4 Common data processing steps

5.4.1 File format conversion

5.4.2 Data deduplication

5.4.3 Data quality checks

5.5 Configurable pipelines