Big Data messaging frameworks compared

10.12.2019

In the last decade, organisations have become reliant on multiple systems and applications to fulfil their business needs. To work effectively, these systems and applications must be able to communicate with each other in a secure and efficient way. Messaging frameworks have become a critical part of the big data stack for these data-driven organisations, although it is difficult to choose which platform will suit their needs.

There are currently three types of messaging frameworks:

Messaging Queue Frameworks – The traditional message queue paradigm, which is to be used only when there is a fixed end-to-end messaging system to support it.

Distributed Messaging Pub-Sub Frameworks – Publish–subscribe is a sibling of the message queue paradigm. This pattern provides greater network scalability and a more dynamic network topology, with a resulting decreased flexibility to modify the publisher and the structure of the published data.

Distributed Stream Processing Frameworks – Stream processing frameworks are runtime libraries which help developers write code to process streaming data, without dealing with lower level streaming mechanics.

In this blog we give an in-depth overview of these three types of messaging frameworks and a comparison of the specific platforms available in today’s market.

Messaging Queue Frameworks

Active MQ / RabbitMQ / ZeroMQ / RocketMQ

  • These are earlier traditional message brokers with more emphasis on queuing rather than streaming. 
  • They are built over point to point messaging models.
  • These are recommended only when there is a fixed end to end communication system.


Distributed Messaging Pub-Sub Frameworks 

Apache Kafka

  • Apache Kafka is more mature and stable distributed and scalable publish-subscribe data streaming platform with simple producer-consumer, distributed broker, message topics, append only logs and distributed partitions modal.

Apache Pulsar

  • Similarly to Kafka, Apache Pulsar is also an open-source distributed and scalable pub-sub messaging system - originally created at Yahoo and now part of the Apache Software Foundation.


Distributed Stream Processing Frameworks

Apache Samza

  • Apache Samza is a distributed and scalable real time stream processing framework. Samza allows you to build stateful applications that process data in real-time from multiple sources including Apache Kafka.

Apache Flink

  • Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

Apache Spark

  • Apache Spark is a unified analytics engine for large-scale data processing. It achieves high performance for batch and streaming data engine, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

Apache Storm

  • Apache Storm is an open source distributed real time computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for real time processing what Hadoop did for batch processing.


Comparison table:

Framework

Features
& Deployment

Scalable

Highly Available

Backup & Recovery

Message Queues
ActiveMQ
MQTT etc

These are point to point-based messaging queues.

They are not built to be distributed/ scalable.

Active models will ensure the state of High Availability.

 

Apache Kafka

A distributed Kafka Message Broker Service platform supporting all required features over data streaming solutions.

Kafka is managed in multi-cluster broker nodes by Zookeeper Ensemble. Kafka has become a data platform rather than just a streaming platform.

Data Ingestion, Storage, Processing and Analysis can be done on this platform.

Higher throughput than ActiveMQ, RabbitMQ and less latency.

Supports Active-Active, Active-Passive, Stretched Cluster.

Most popular, matured and supported framework.

Apache Pulsar

Immediate alternative to Kafka, as Pulsar supports Streaming and Queuing.

Higher throughput than ActiveMQ, RabbitMQ and less latency.

Apache Pulsar is used as a data platform up to data processing, and storage.

Best alternative to Kafka.

Apache Samza

Event-based stream processing platform which run on Yarn Containers.

Highly Scalable for processing and is up to 1 million messages/sec/ machine.

Depends on Yarn multiple resource managers, Zookeeper Ensemble.

Fault-tolerant replication model for processing.

Apache Flink

Distributed processing framework support over bounded and unbounded streams which can run on Yarn, Mesos, Kubernetes Containers.

Applications processing multiple trillions of events per day,

applications maintaining multiple terabytes of state, and

applications running on thousands of cores.

It follows Hadoop Job Manager High Availability.

State checkpointing model on HDFS and recovers from last checkpoint when there is a failure.

Apache Spark

Spark Streaming API enables streaming of data for processing which can run distributed on Yarn or Mesos or stand-alone. 

1 million messages/sec/machine.

Depends on Yarn multiple resource managers, Zookeeper Ensemble.

Fault-tolerant replication model for processing.

Apache Storm

It is a real-time continuous data processing framework that works on distributed. Mostly java cluster nodes managed by Zookeeper.

One million 100 byte messages per second per node on hardware with the following specs:

Processor: 2x Intel E5645@2.4Ghz

Memory: 24 GB

Cluster of nodes can restart and balance the data stream to ensure high available state of processing.

Different Active-Passive switch back process.

 

Recommendations

  • Messaging Queue Frameworks - Active MQ / RabbitMQ / ZeroMQ / RocketMQ
    • These should only be chosen when there is a fixed point to point communication system with standard messaging format.
    • These are not designed to be distributed and scalable.
  • Distributed Messaging Pub-Sub Frameworks - Kafka / Pulsar
    • Pub-Sub Frameworks are most suitable for current data streaming challenges.
    • Kafka is more popular, based on huge community support and partner support on multiple technology providers.
    • It’s highly simple, flexible, scalable, highly available, fault-tolerant architecture. 
  • Distributed Stream Processing Frameworks - Spark, Samza, Flink, Storm
    • Stream processing is an add-on feature for all distributed big data processing frameworks.
    • Apache Spark is more popular and proven with multi partner support on data platforms.
    • It’s highly simple distributed in-memory processing.


Conclusions
:

Distributed Messaging Broker platform (Kafka) is actively evolved in the market as a nervous connection network for any data platforms or any type of data engines.

 

If you would like to find out  how to bring best practice in your Kafka deployment and optimise the performance and scalability of your Kafka clusters, then give us a call on +44 (0)203 475 7980 or email us at marketing@whishworks.com.

Other useful links:

The Business Sense of Artificial Intelligence

WHISHWORKS Expert Kafka Services

7 steps to Predictive Analytics

 

Recent Posts