11 Operating Kafka
This chapter covers
- The evolution of Kafka clusters
- Monitoring for maintaining cluster health
- Performance tuning strategies
- Disaster recovery and backup consideration
When Kafka is used as a managed service in the cloud, many operational concerns are abstracted away. Tasks like upgrades, monitoring, and broker management are handled by the provider, allowing teams to focus on building applications. However, running Kafka on-premise is a different story. It demands deep operational expertise—ranging from performing safe software upgrades to carefully tuning configurations and continuously monitoring cluster health.
In this chapter, we’ll explore what it takes to maintain a robust, self-managed Kafka cluster. You’ll learn how to:
- Safely perform hardware and software updates
- Add or remove brokers from the cluster
- Modify configurations at the broker, topic, and partition level
Understanding these maintenance tasks is essential for keeping your Kafka deployment resilient, efficient, and ready to scale.
This chapter does not intend to be a complete operational guide. Operational practices and tooling evolve rapidly and are often tightly coupled with specific environments and monitoring stacks. Instead, we'll focus on foundational principles and best practices that apply across most Kafka deployments.
11.1 Cluster evolution and upgrades
There are several important reasons to upgrade the software of a Kafka cluster: