chapter six

6 Real-time data processing and analytics

 

This chapter covers:

  • A definition of real-time processing and real-time analytics and some associated sample use cases
  • How best to organize data in “fast” storage
  • Typical real-time data transformation scenarios
  • A comparison of real-time processing services available from Amazon Web Services (AWS), Microsoft Azure and Google Cloud (GC)

By the end of this chapter you’ll be able to:

  • Recognize valid use cases for real-time processing and difference between different real time scenarios
  • Organize data for real-time processing
  • Translate common data transformations into real-time processing
  • Differentiate between the various real-time service offerings available from the three major cloud vendors

In this chapter, we’ll help you get a clear understanding of real-time or streaming data - one of the most popular features of a modern data platform.

We’ll cover the difference between real-time ingestion and real-time processing and walk through some examples of when to use one or both, showing different data platform designs.

We’ll also go deeper into how streaming data is organized - with producers, consumers, messages, partitions and offsets. Then we’ll walk through some typical real-time data transformation use cases, wth particular attention on dealing with data deduplication, file format conversion, real-time data quality checks, and combining batch and real-time data.

6.1 Real-time ingestion vs real-time processing

6.2 Use cases for real time data processing

6.2.1 Retail use case - real-time ingestion

6.2.2 Online gaming use case - real-time ingestion and real-time processing

6.2.3 Real-time ingestion vs real-time processing summary

6.3 When should you use real-time ingestion and/or real-time processing?

6.4 Organizing data for real-time use

6.4.1 The anatomy of fast storage

6.4.2 How does fast storage scale?

6.4.3 Organizing data in the real-time storage

6.5 Common data transformations in real time

6.5.1 Causes of duplicates in real-time systems

6.5.2 Deduplicating data in the real-time systems

6.5.3 Converting message formats in real-time pipelines

6.5.4 Real-time data quality checks

6.5.5 Combining batch and real-time data

6.6 Cloud services for real-time data processing

6.6.1 AWS real-time processing services

6.6.2 Google Cloud real-time processing services

6.6.3 Azure real-time processing services

6.7 Summary