chapter eleven

11 Operating Kafka

This chapter covers

The evolution of Kafka clusters
Monitoring for maintaining cluster health
Performance tuning strategies
Disaster recovery and backup consideration

When Kafka is used as a managed service in the cloud, many operational concerns are abstracted away. Tasks like upgrades, monitoring, and broker management are handled by the provider, allowing teams to focus on building applications. However, running Kafka on-premise is a different story. It demands deep operational expertise—ranging from performing safe software upgrades to carefully tuning configurations and continuously monitoring cluster health.

In this chapter, we’ll explore what it takes to maintain a robust, self-managed Kafka cluster. You’ll learn how to:

Safely perform hardware and software updates
Add or remove brokers from the cluster
Modify configurations at the broker, topic, and partition level

Understanding these maintenance tasks is essential for keeping your Kafka deployment resilient, efficient, and ready to scale.

This chapter does not intend to be a complete operational guide. Operational practices and tooling evolve rapidly and are often tightly coupled with specific environments and monitoring stacks. Instead, we'll focus on foundational principles and best practices that apply across most Kafka deployments.

11.1 Cluster evolution and upgrades

There are several important reasons to upgrade the software of a Kafka cluster:

11.1.1 Adding brokers and distributing the load

11.1.2 Removing a broker from the cluster

11.1.3 Upgrading clients

11 Operating Kafka

This chapter covers

11.1 Cluster evolution and upgrades

11.1.1 Adding brokers and distributing the load

11.1.2 Removing a broker from the cluster

11.1.3 Upgrading clients

11.1.4 Data mobility

11.2 Monitoring Kafka cluster

11.2.1 Types of metrics in monitoring

11.2.2 Kafka monitoring objects

11.2.3 Ownership of Monitoring Responsibilities

11.2.4 Monitoring Stacks and Tools

11.3 Performance Tuning Clinic

11.3.1 Balancing throughput and latency

11.3.2 Balancing data safety and up-time

11.4 Disaster Recovery & failover

11.4.1 RTO/RPO Engineering

11.5 Online Resources

11.6 Summary