6 Stateful transformations with Kafka Streams

 

This chapter covers

  • The state management of Kafka Streams
  • Aggregating records
  • Deduplicating data
  • Joining records of different topics
  • A comparison of Kafka Streams and Apache Flink

In the last chapter, we applied stateless stream processing to transform, filter, and route single events from the analytics API at a time. If you need to process multiple Kafka messages from one or multiple topics at a time, for instance, when joining the analytics data with records from your user database, you need to maintain state across the processing of multiple events. This chapter covers the latter case and introduces you to stateful stream processing.

tip

This chapter does not contain the complete code of the Kafka Streams applications but shows only the topology. You can add the examples to the application that you built in Chapter 5 or start from scratch using the project template from https://github.com/flippingbits/streaming-data-pipelines-book-streams-template. As discussed in the introduction of Chapter 5, please make sure that the topic online_shop.analytics.visitors holds data before you start implementing the examples from this chapter.

6.1 How does Kafka Streams manage state?

6.2 Aggregating data

6.3 Suppressing partial results and handling late events

6.4 Deduplicating data

6.5 Joining data

6.6.1 Watermarks

6.6.2 Checkpoints

6.7 Summary