Running Zookeeper in replicated mode (for Apache Kafka)

  • Written By Kumar Raju Datla
  • 24/08/2020

In this blog, we will explore the recommendations for running Zookeeper in replicated mode and why this is the best method for using Zookeeper within your Apache Kafka implementation.

Zookeeper is a distributed, open-source coordination service developed by Apache that acts as a centralised service and is used to maintain naming and configuration data to provide flexible and robust synchronisation within distributed systems.

Kafka uses Zookeeper to manage service discovery for Kafka brokers that form the cluster. Zookeeper sends changes of the topology to Kafka, so each node in the cluster knows when a new broker joined, a broker died, a topic was removed or a topic was added, etc.

In summary, Zookeeper provides an in-sync view of the Kafka Cluster configuration.

Why is Zookeeper important for the Kafka Cluster?

Kafka relies on Zookeeper for the following key information about Kafka’s metadata:

1.     Topic Registration Info

2.     Partition State Info

3.     Broker Registration Info

4.     Controller EPOCH & Registration

5.     Consumer Registration, Owner, Offset Info

6.     Reassign partitions

7.     Preferred Replica Election

Note: For more granular configuration details, please follow this link.

Why should you run Zookeeper in replicated mode?

Running ZooKeeper in standalone mode is convenient for evaluation, some development, and testing, but in production, you should run ZooKeeper in replicated mode. This is because single node Zookeeper is not considered fault-tolerant and this type of setup is not-at-all recommended in production as a fair co-ordination service.

When running in replicated mode you are working with a replicated group of servers in the same application – this is called a quorum.

Quorum is necessary for Zookeeper’s service and high availability in general for its clients. In this case, it’s the Kafka Cluster co-ordination service. In order to qualify as a reliable co-ordination service with high-availability and minimum fault-tolerance, it has to form a healthy ensemble of Zookeeper nodes. 

To give you an idea of how many nodes you should have replicated, we have outlined below how number of nodes can affect tolerance level and high availability status of a Kafka Cluster:

No. of Zookeeper NodesIs this a Quorum?Zookeeper’s ToleranceKafka
High Availability
Reasoning
1No0NoKafka is down if 1 Zookeeper goes down.
2No0NoEach node is different. So, for Kafka it registers as a single Zookeeper node.

This means that Kafka is down if any of these two Zookeeper nodes are down.
3Yes1YesZookeeper quorum of 3 is minimum HA with tolerance level 1 node failure.

Zookeeper leader coordinates and maintains the Kafka Cluster state correctly until or before it goes offline.

When any Zookeeper is taken down for an upgrade, in 3 nodes quorum, then, further tolerance is zero (0) and there are only 2 zk nodes, with which the Kafka cluster is up with no Zookeeper coordination service.

If there is no Zookeeper coordination service, then no changes from Kafka cluster is co-ordinated and no further tolerance of DR event or any further failures expected in Zookeeper and Kafka cluster services.
4Yes1YesThis state is the same as the Zookeeper quorum with 3 nodes.

No added advantage with the 4th node.
5Yes2YesTolerance is 2. It means 1 node can be down for rolling upgrade and 1 other node for dr event / outage / failure.At any time, Zookeepers HA is available with average maintenance and DR tolerance of 2 nodes down.
5+Not Recommended3+Can be Highly Available.But, Not PerformantDue to the increase of high concurrency in the Zookeeper itself, it adds latency to the coordinating clients’ cluster (Kafka).

What does the optimal 3 nodes Zookeeper quorum look like in Kafka?

  •  Zookeeper quorum of 3 is minimum HA with tolerance level 1 node failure.
  •  Zookeeper leader coordinates and maintains the Kafka Cluster state correctly until or before it goes offline.
  • When any Zookeeper is taken down for any upgrade, in 3 nodes quorum, then, further tolerance is zero (0) and there would be only two Zookeeper nodes remaining running in the Kafka cluster but with no co-ordination service. There would be only two Zookeeper nodes remained running in kafka cluster but with no co-ordination service.
  •  If there is no Zookeeper coordination service, then no changes from Kafka cluster is co-ordinated and no further tolerance of DR events or any further failures expected in Zookeeper and Kafka cluster services.

When should 5 nodes Zookeeper quorum be used?

  • Zookeepers quorum of 5 is the minimum recommended average tolerance of 2 nodes down for production deployments to keep the Zookeeper service highly available for Kafka cluster coordination.
  • Operational Scenarios,
    • Expected Outage – 1 node can be taken down for expected rolling upgrades or patches.
    • Unexpected Outage – 1 node is for any unexpected tolerance of DR event / outage / failure.


For more information about the Confluent platform Zookeeper deployments, please check the following Confluent reference documentation:

 Multi-node setup of Zookeeper for Kafka 

Running Zookeeper in production 

Final Thoughts

Having explored running Zookeeper within the Kafka Cluster in replicated mode and given our recommendations, we hope your questions around the optimal number of Zookeeper nodes were answered. If you have any other questions, as a preferred Confluent partner and Kafka expert, we can help – simply contact us here.

If you would like to find out how to become a data-driven organisation with event streaming, Kafka and Confluent, then give us a call or email us at marketing@whishworks.com.

Other useful links:

Journey to the event-driven business with Kafka

WHISHWORKS Expert Kafka Services

WHISHWORKS Confluent Services

Latest Insights

Blogs

Introduction to: Event Streaming

In this blog we introduce the key components of event streaming, including outlining the differences between traditional batch data processing and real-time event streaming.

Dynamic Overlay for PDF Template
Blogs

Developer’s guide: creating a dynamic overlay for a PDF template

In this blog, we provide a step-by-step solution to dynamically changing the template of a PDF document using the open source software PDFbox.

Infographic Kafka banking
Blogs

Transforming Banking with Apache Kafka

In this blog (and infographic) we summarise the key takeaways from that webinar, showcasing how forward-looking banks are getting ahead of the curve with real-time streaming.