How to Set-up Kafka on AWS Cloud

  • Written By Awanish Kumar
  • 28/10/2020

In this blog, we outline the various steps involved in the installation and configuration of Apache Kafka on AWS cloud.

Configuring AWS EC2 Instances

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers. Amazon EC2’s simple web service interface allows you to obtain and configure capacity with minimal friction. It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment.

We will be using a total of 8 EC2 instances – out of 8, 3 instances are of type m5a.large and the remaining 5 are of type t2.micro.

Kafka setup in AWS

Installing Apache Kafka

Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation. The project aims to provide a unified, high-throughput, low-latency platform for handling of real-time data feeds.

  1. Kafka config directory on node KafkaServer1: /opt/kafka/config/server.properties
Kafka setup fig 1
  1. External & Internal Listener Configurations in KafkaServer1 node:
Kafka-setup-Fig-2-1

Based on the above configuration Kafka is listening on port 9093 for external communication & on 9092 for internal communication. We did not make the above changes on the KafkaServer2 & KafkaServer3. For our testing in the WHISHWORKS environment we have used only KafkaServer1 node. Remaining KafkaServer2 & KafkaServer3 nodes are stopped most of the time.

  1. Kafka Scripts Directory on node KafkaServer1: /opt/kafka/bin
Kafka setup fig 3

Setting up persistence DB on EC2: MongoDB

  1. Mongod configuration directory:
Kafka setup fig 4
  1. The only config change in the above file mogod.conf which we made is, bindIp is mapped to 0.0.0.0:
Kafka setup fig 5
  1. Mongod is listening on port 27017 on node KafkaServer1 (private ip: 172.31.28.4), to access mongod in kafkaserver1, just execute command mongo (refer to the below screenshot):
Kafka setup fig 6

In the above screenshot we can see the table’s auditdata, workitems & userAuthorisation which we are being used for the  workflow engine solution.

Monitoring using Grafana

Prometheus: This is an open-source systems monitoring and alerting toolkit.

Features:

  • Multidimensional data model with time series data identified by metric name and key/value pairs.
  • PromQL, a flexible query language to leverage this dimensionality.
  • No reliance on distributed storage; single server nodes are autonomous.
  • Time series collection happens via a pull model over HTTP.
  • Pushing time series is supported via an intermediary gateway.
  • Targets are discovered via service discovery or static configuration.
  • Multiple modes of graphing and dash boarding support.

Components:

  1. Prometheus is installed in directory: /prometheus/prometheus-2.15.2.linux-amd64 on node KafakServer1:
Kafka setup fig 7
  1. Prometheus is listening on port: 9099 (refer to the below screenshot)
Kafka setup fig 8
  1. Prometheus JMX agent which we downloaded to pull the metrics from Kafka node is in the below location:
Kafka setup fig 9

Grafana: An open source visualization and analytics software. It allows you to query, visualize, alert on, and explore your metrics no matter where they are stored. In plain English, it provides you with tools to turn your time-series database (TSDB) data into beautiful graphs and visualizations.

  1. Below is the Grafana location:
Kafka setup fig 10
  1. Grafana is listening on port 3000 in node KafkaServer1 (refer to the screenshot below)
Kafka setup fig 11

Deployment Diagram

Kafka setup deployment diagram

Challenges

The major challenge we faced was when configuring the consumer:

  • When calling an external API from the application, the response of the external API exceeds the default setting of the Kafka consumer.
  • It takes more than 5 minutes to send a response back from the external API and the consumer rebalances even though it’s still running. 
  • Kafka consumer has a configuration max.poll.records which controls the maximum number of records returned in a single call to poll() and its default value is 500.
  • max.poll.interval.ms – If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.

To overcome this issue we need to change the value of config max.poll.interval.ms to 10 min and max.poll.records to 25 from 500.

Also accordingly, the request.timeout.ms of consumer needs to be adjusted to balance and several rounds of testing done to come up with final config values.

If you would like to find out how to become a data-driven organisation with event streaming, Kafka and Confluent, then give us a call or email us at marketing@whishworks.com.

Other useful links:

What is Kafka? The Top 5 things you should know

WHISHWORKS Expert Kafka Services

WHISHWORKS Confluent Services

Latest Insights

Salesforce CRM for the public sector
Blogs

CRM for Citizen Relationship Management

While the times have been changing for a while, current events have accelerated the need for public agencies to adapt to a changing world.

Open banking APIs
Blogs

Open Banking – How APIs are fast-tracking growth

When it comes to reaching market faster and scaling with both speed and stability, APIs have been a crucial component of many Fintechs.

Big Data Investment Drivers
Blogs

Key drivers of Big Data investment in 2020

Since 2019, key developments such as COVID-19 have influenced investment trends in Big Data. Drawing on insights from WHISHWORKS’ Big Data Report 2020 we outline the key drivers of investment in Big Data this year.