chapter six

6 Stateful transformations with Kafka Streams

This chapter covers

The state management of Kafka Streams
Aggregating records
Deduplicating data
Joining records of different topics
A comparison of Kafka Streams and Apache Flink

In the last chapter, we applied stateless stream processing to transform, filter, and route single events from the analytics API at a time. If you need to process multiple Kafka messages from one or multiple topics at a time, for instance, when joining the analytics data with records from your user database, you need to maintain state across the processing of multiple events. This chapter covers the latter case and introduces you to stateful stream processing.

tip

This chapter does not contain the complete code of the Kafka Streams applications but shows only the topology. You can add the examples to the application that you built in Chapter 5 or start from scratch using the project template from https://github.com/flippingbits/streaming-data-pipelines-book-streams-template. As discussed in the introduction of Chapter 5, please make sure that the topic online_shop.analytics.visitors holds data before you start implementing the examples from this chapter.

6.1 How does Kafka Streams manage state?

6 Stateful transformations with Kafka Streams

This chapter covers

tip

6.1 How does Kafka Streams manage state?

6.2 Aggregating data

6.3 Suppressing partial results and handling late events

6.4 Deduplicating data

6.5 Joining data

6.6 Kafka Streams vs. Apache Flink

6.6.1 Watermarks

6.6.2 Checkpoints

6.7 Summary