5 Processing

This chapter covers

Processing data and common schemas
Tying datasets together through an identity keyring
Building a timeline view of events
Operating continuous data processing

This chapter is all about data processing. In part 1, we looked at various infrastructure pieces for our data platform. With those in place, we’ll shift our focus to supporting common workloads: data processing, analytics, and machine learning. The focus of this chapter is data processing, reshaping the raw data we ingest into our platform to better suit our analytical needs. Figure 5.1 highlights this chapter’s focus on our orientation map.

Figure 5.1 Data processing, specifically, reshaping the ingested raw data to facilitate analytics, is a common workload that we need to support.

First, we’ll talk about some common data modeling concepts such as normalizing data to reduce duplication and ensure integrity and denormalizing data to improve query performance. We’ll learn about fact tables and dimension tables and the commonly used star and snowflake schemas. Next, we’ll build an identity keyring and see how it helps us connect all the different identities managed by different groups across an enterprise. This is a data model built on top of the raw data ingested in our platform, giving it a better structure, which also facilitates analytics.

5.1 Data modeling techniques

5.1.1 Normalization and denormalization

5.1.2 Data warehousing

5.1.3 Semistructured data

5.1.4 Data modeling recap

5.2 Identity keyrings

5.2.1 Building an identity keyring

5.2.2 Understanding keyrings

5.3 Timelines

5.3.1 Building a timeline view

5.3.2 Using timelines

5.4 Continuous data processing

5.4.1 Tracking processing functions in Git