In this chapter, we’ll help you get a clear understanding of real-time or streaming data—one of the most popular features of a modern data platform.
We’ll cover the difference between real-time ingestion and real-time processing and walk through some examples of when to use one or both, showing different data platform designs.
We’ll also go deeper into how streaming data is organized—with producers, consumers, messages, partitions, and offsets. Then we’ll walk through some typical real-time data transformation use cases, with particular attention on dealing with data deduplication, file format conversion, real-time data quality checks, and combining batch and real-time data.
Last, each cloud vendor provides a pair of related services for real-time processing—one that implements real-time storage and maps to the fast storage layer in our architecture, and another that implements the real-time processing. We will look at AWS Kinesis Data Streams and Kinesis Data Analytics, Google Cloud’s Pub/Sub and Cloud Dataflow, and Azure Event Hubs and Azure Stream Analytics.