This chapter is all about data processing. In part 1, we looked at various infrastructure pieces for our data platform. With those in place, we’ll shift our focus to supporting common workloads: data processing, analytics, and machine learning. The focus of this chapter is data processing, reshaping the raw data we ingest into our platform to better suit our analytical needs. Figure 5.1 highlights this chapter’s focus on our orientation map.
Figure 5.1 Data processing, specifically, reshaping the ingested raw data to facilitate analytics, is a common workload that we need to support.
First, we’ll talk about some common data modeling concepts such as normalizing data to reduce duplication and ensure integrity and denormalizing data to improve query performance. We’ll learn about fact tables and dimension tables and the commonly used star and snowflake schemas. Next, we’ll build an identity keyring and see how it helps us connect all the different identities managed by different groups across an enterprise. This is a data model built on top of the raw data ingested in our platform, giving it a better structure, which also facilitates analytics.