chapter fifteen

15 Kafka monitoring and alerting

This chapter covers

Ensuring Kafka’s performance and reliability
Key metrics to track for Kafka
Strategies for effective alerting
Monitoring challenges in various Kafka deployment environments

Although Kafka is designed to be fault-tolerant and is therefore very robust against errors, it is, of course, not completely invulnerable. After all, Kafka runs on physical hardware, which can fail. While Kafka can easily compensate for the failure of individual brokers or, depending on the size of the cluster, even several brokers, we still need to ensure that such failures aren’t prolonged and are resolved as quickly as possible. Otherwise, we risk the complete failure of our cluster because, naturally, with each broker that goes down, the remaining fault tolerance decreases.

So, what exactly do we mean when we talk about an error or failure? This refers to any impairment of full functionality, which in the worst case could mean the complete unavailability of the system. To respond appropriately, we need to understand where exactly the problem lies and what is causing it. Particularly in today’s complex IT systems, answering the latter question can often be difficult and may require lengthy detective work, as the root cause is often unexpected side effects from other IT systems that are otherwise functioning properly.

15.1 Infrastructure metrics

15.2 Broker metrics

15.2.1 Kafka server metrics

15.2.2 Kafka log metrics

15.2.3 Kafka network metrics

15.2.4 Kafka controller metrics

15.3 Client metrics

15.3.1 General client metrics

15.3.2 Producer metrics

15.3.3 Consumer metrics

15.3.4 Kafka Connect and Kafka Streams metrics

15.4 Alerting

15.4.1 From metrics to alerts

15.4.2 From alerts to problem solving

15.5 Kafka deployment environments and their monitoring challenges

15.5.1 Kafka on a company’s own hardware

15.5.2 Kafka on virtual machines

15.5.3 Kafka in the public cloud