chapter twelve

12 Stream processing

This chapter covers

An overview of stream processing frameworks
Partitioning and parallelization mechanisms in Kafka Streams
Implementing SQL-like queries in stream processing
Demonstrating use cases for Kafka Streams

We now know different ways to populate Kafka with data. We can use producers to send data directly to Kafka, which makes the most sense when we’re near the data source. For example, our machines in the factory are equipped with Kafka producers to send measurement data and events directly to Kafka, or we use Kafka to collect log data from servers or website visits. On the other hand, if we want to collect data from databases, files, or cloud services with Kafka, it’s worth taking a look at Kafka Connect.

Similarly, we’re familiar with various methods to read data from Kafka and make it available to third-party systems. We often use Kafka consumers to display data directly or trigger actions in third-party systems. Conversely, when we want to write data from Kafka to other systems, we advise our customers to consider Kafka Connect, as it’s often a more suitable approach than implementing custom consumers.

With these tools, we have numerous ways to implement highly performant and useful systems. We can exchange data between different systems in near real-time or create modern integration pipelines. Originally, Kafka was used to provide massive data from various big data systems such as Hadoop to be batch-processed later.

12.1 Stream processing overview

12.1.1 Stream-processing libraries

12.1.2 Processing data

12.2 Stream processors

12.2.1 Processor types

12.2.2 Processor topologies

12.3 Stream processing using SQL

12.4 Stream states

12.4.1 Streams and tables

12.4.2 Aggregations

12.4.3 Streaming joins

12.4.4 Use case: Notifications

12.5 Streaming and time

12.5.1 Time is relative

12.5.2 Time windows

12.5.3 Use case: Fraud detection

12.6 Scaling Kafka Streams

Summary