Kafka is a distributed, partitioned, replicated, log service that is a massively scalable pub/sub message queue architected as a distributed transaction log. It provides a unified platform for handling all the real-time data feeds a large company might have. In our last article series of Kafka, we learned about its brief regarding introduction, function and architecture. Let’s explore more about performance and scalability improvement for tuning Kafka and the factors related to it.
Why Kafka is better than existing messaging systems like JMS or IBM MQSeries
In Kafka, there is no concept of Queue. No ‘send/receive’ for putting/getting messages from the queue. The only paradigm available as a messaging model is publish-subscribe. This paradigm is very similar between MQ/JMS and Kafka – the difference is under the covers that we will discuss next.
In MQ/JMS systems once the message is read it is removed from the storage and is no longer available. Even after all the subscribers have read the message, Kafka still retains the messages. The retention period in this system is a configurable parameter. The message is deleted by the messaging system on receiving an ACK/Commit in a typical MQ/JMS consumer implementation. If for some reason the message gets processed but fails before the
ACK/Commit, it would lead to the message being read more than once. This Kafka system addressed the problem by way of message retention and state management based on the consumer offset. Kafka has implemented the topics as partitioned logs. This is one of the biggest differences between MQ/JMS and Kafka. The partitioning of the topic leads to its high throughput (and parallelism).
In MQ/JMS there is no guarantee that the messages will be received in a sequence in which they were sent. In Kafka, if the topic is configured with a single partition then the messages are received in the same order that they were sent in.
Consumer specifies the offset from which the message in the log is read as it is the part of fetch. This unique feature of Kafka is very different from the MQ/JMS messaging system. Here, First In First Out (FIFO) is the way the messages are read off the queue/topic. Also, with the offset based control, the consumer can re-read the same message which is not possible in MQ/JMS. This rewinding mechanism can be very handy in some situations.
The other differences are the case of MQ/JMS the load balancing required messaging systems to be designed using some clustering mechanism and distributing the load across the cluster members was on the producer to send the messages. This allows the client to send messages to the appropriate server (and partition) thus distributing the message load across the cluster members.
The concept of message replication was not included in Traditional MQ/JMS implementations but some systems built it over a period of time; those replication features at most times were not leveraged in favour of simplicity. In Kafka, as described earlier the messages are replicated (leader-followers) for each topic’s partitions across a configurable number of servers. This inherently leads to an architecture that provides automatic failover to replica thus leading to high availability.
Kafka Performance Optimization
While talking about Kafka Performance tuning, there are few configuration parameters to be considered. Therefore, the most important configurations are the one that controls the disk flush rate for improving performance.
So dividing these configurations on component-based, let’s be clear about Producer first. Hence, most important configurations which need to be taken care at Producer side are –
- Batch size
- Sync or Async
At Consumer side, the important configuration is –
- Fetch size
Although, when we think about the batch size it always gets confused that it will be optimal. We can say, large batch size may be great to have high throughput as it comes with latency issues. This implies that the latency and throughput are inversely proportional to each other.
THROUGHPUT & CONFIGURATIONS FOR TUNING KAFKA
Kafka Performance Tuning Graph
Kafka Tuning involves two important metrics: Latency measures and Throughput measures. Latency measures evaluate how long it takes to process one event whereas Throughput measures show how many events arrive within a specific amount of time. So, most systems are optimized for one at a time, either latency or throughput. Well, Apache Kafka balances both. A well-tuned Kafka system has enough brokers to handle topic throughput also, given the latency required to process information as it is received.
a. Tuning Kafka Producers
As we know, Kafka uses a publish-subscribe model. While our producer calls the send() command, the result returned is a future. That future offers methods to check the status of the information in the process. Moreover, as the batch is ready, the producer sends it to the broker. Basically, the broker waits for an event, then, receives the result, and further responds that the transaction is complete.
Two parameters are particularly important for Kafka performance Tuning for latency and throughput:
i. Batch Size- Instead of the number of messages, batch.size measures batch size in total bytes. That means it controls how many bytes of data to collect, before sending messages to the Kafka broker.
ii. Linger Time- To buffer data in asynchronous mode, linger.ms sets the maximum time.
b. Tuning Kafka Brokers
As we know, Topics are divided into partitions. Further, each partition has a leader. Also, with multiple replicas, most partitions are written into leaders. However, if the leaders are not balanced properly, it might be possible that one might be overworked, compared to others.
So, based on our system or how critical our data is, we want to be sure that we have sufficient replication sets to preserve our data. It is recommended that starting with one partition per physical storage disk and one consumer per partition.
c. Tuning Kafka Consumers
Basically, Kafka Consumers can create throughput issues. It is just that the number of consumers for a topic is equal to the number of partitions. Because, to handle all the consumers needed to keep up with the producers, we need enough partitions.
In the same consumer group, consumers split the partitions among them. Hence, adding more consumers to a group can enhance performance, also adding more consumer groups does not affect performance. Also, we can mark the last point where we read information while reading from a partition. In this way, we have a checkpoint from which to move forward without having to reread prior data, if we have to go back and locate the missing data. So, we will never lose a message, if we set the checkpoint watermark for every event, but it significantly impacts performance.
We have seen the whole concept of Kafka Performance tuning in this article. The next article will cover the role of an Offset and how to monitor the health of Kafka Cluster.