6 Real-time data processing and analytics

 

This chapter covers

  • Defining real-time processing and real-time analytics
  • Organizing data in fast storage
  • Understanding typical real-time data transformation scenarios
  • Organizing data for real-time use
  • Translating common data transformations into real-time processing
  • Comparing real-time processing services

In this chapter, we’ll help you get a clear understanding of real-time or streaming data—one of the most popular features of a modern data platform.

We’ll cover the difference between real-time ingestion and real-time processing and walk through some examples of when to use one or both, showing different data platform designs.

We’ll also go deeper into how streaming data is organized—with producers, consumers, messages, partitions, and offsets. Then we’ll walk through some typical real-time data transformation use cases, with particular attention on dealing with data deduplication, file format conversion, real-time data quality checks, and combining batch and real-time data.

Last, each cloud vendor provides a pair of related services for real-time processing—one that implements real-time storage and maps to the fast storage layer in our architecture, and another that implements the real-time processing. We will look at AWS Kinesis Data Streams and Kinesis Data Analytics, Google Cloud’s Pub/Sub and Cloud Dataflow, and Azure Event Hubs and Azure Stream Analytics.

6.1 Real-time ingestion vs. real-time processing

6.2 Use cases for real-time data processing

6.2.1 Retail use case: Real-time ingestion

6.2.2 Online gaming use case: Real-time ingestion and real-time processing

6.2.3 Summary of real-time ingestion vs. real-time processing

6.3 When should you use real-time ingestion and/or real-time processing?

6.4 Organizing data for real-time use

6.4.1 The anatomy of fast storage

6.4.2 How does fast storage scale?

6.4.3 Organizing data in the real-time storage

6.5 Common data transformations in real time