chapter six

6 Real-time data processing and analytics

This chapter covers:

A definition of real-time processing and real-time analytics and some associated sample use cases
How best to organize data in “fast” storage
Understanding typical real-time data transformation scenarios
Organizing data for real-time use
Understanding common data transformations and translate them into real-time processing
Comparing real-time processing services available from Amazon Web Services (AWS), Microsoft Azure and Google Cloud (GC)

In this chapter, we’ll help you get a clear understanding of real-time or streaming data - one of the most popular features of a modern data platform.

We’ll cover the difference between real-time ingestion and real-time processing and walk through some examples of when to use one or both, showing different data platform designs.

We’ll also go deeper into how streaming data is organized - with producers, consumers, messages, partitions and offsets. Then we’ll walk through some typical real-time data transformation use cases, wth particular attention on dealing with data deduplication, file format conversion, real-time data quality checks, and combining batch and real-time data.

6.1 Real-time ingestion vs real-time processing

6.2 Use cases for real time data processing

6.2.1 Retail use case - real-time ingestion

6.2.2 Online gaming use case - real-time ingestion and real-time processing

6 Real-time data processing and analytics

This chapter covers:

6.1 Real-time ingestion vs real-time processing

6.2 Use cases for real time data processing

6.2.1 Retail use case - real-time ingestion

6.2.2 Online gaming use case - real-time ingestion and real-time processing

6.2.3 Real-time ingestion vs real-time processing summary

6.3 When should you use real-time ingestion and/or real-time processing?

6.4 Organizing data for real-time use

6.4.1 The anatomy of fast storage

6.4.2 How does fast storage scale?

6.4.3 Organizing data in the real-time storage

6.5 Common data transformations in real time

6.5.1 Causes of duplicates in real-time systems

6.5.2 Deduplicating data in the real-time systems

6.5.3 Converting message formats in real-time pipelines

6.5.4 Real-time data quality checks