5 Processing

 

In this chapter:

  • Processing data and common schemas.
  • Tying datasets together through an identity keyring.
  • Building a timeline view of events.
  • Operating continuous data processing.

This chapter is all about data processing. In part 1, we looked at various infrastructure pieces for our data platform. With those in place, we’ll shift our focus to supporting the common workloads: data processing, analytics, and machine learning. The focus of this chapter is data processing – reshaping the raw data we ingest into our platform to better suit our analytical needs. Figure 5.1 highlights this chapter on our orientation map.

Figure 5.1 Data processing is a common workload we need to support: reshaping the ingested raw data to facilitate analytics.
Diagram Description automatically generated

First, we’ll talk about some common data modeling concepts – normalizing data to reduce duplication and ensure integrity and denormalizing data to improve query performance. We’ll learn about fact tables and dimension tables, and the commonly used star and snowflake schemas.

Next, we’ll build an identity keyring and see how it can help us connect all the different identities managed by different groups across an enterprise. This is a data model built on top of the raw data ingested in our platform, giving it a better structure, and facilitating analytics.

5.1    Data modeling techniques

5.1.1   Normalization and denormalization

5.1.2   Data warehousing

5.1.3   Semi-structured data

5.1.4   Data modeling recap

5.2    Identity keyrings

5.2.1   Building an identity keyring

5.2.2   Understanding keyrings

5.3    Timelines

5.3.1   Building a timeline view

5.3.2   Using timelines

5.4    Continuous data processing

5.4.1   Tracking processing functions in Git

5.4.2   Keyring building in Data Factory

5.4.3   Scaling out

5.5    Summary

sitemap