chapter six

6 Real-time data processing and analytics

 

This chapter covers:

  • A definition of real-time processing and real-time analytics and some associated sample use cases
  • How best to organize data in “fast” storage
  • Understanding typical real-time data transformation scenarios
  • Organizing data for real-time use
  • Understanding common data transformations and translate them into real-time processing
  • Comparing real-time processing services available from Amazon Web Services (AWS), Microsoft Azure and Google Cloud (GC)

In this chapter, we’ll help you get a clear understanding of real-time or streaming data - one of the most popular features of a modern data platform.

We’ll cover the difference between real-time ingestion and real-time processing and walk through some examples of when to use one or both, showing different data platform designs.

We’ll also go deeper into how streaming data is organized - with producers, consumers, messages, partitions and offsets. Then we’ll walk through some typical real-time data transformation use cases, wth particular attention on dealing with data deduplication, file format conversion, real-time data quality checks, and combining batch and real-time data.

6.1      Real-time ingestion vs real-time processing

6.2      Use cases for real time data processing

6.2.1   Retail use case - real-time ingestion

6.2.2   Online gaming use case - real-time ingestion and real-time processing

6.2.3   Real-time ingestion vs real-time processing summary

6.3      When should you use real-time ingestion and/or real-time processing?

6.4      Organizing data for real-time use

6.4.1   The anatomy of fast storage

6.4.2   How does fast storage scale?

6.4.3   Organizing data in the real-time storage

6.5      Common data transformations in real time

6.5.1   Causes of duplicates in real-time systems

6.5.2   Deduplicating data in the real-time systems

6.5.3   Converting message formats in real-time pipelines

6.5.4   Real-time data quality checks