Now, Kafka provides an ideal mechanism for storing consumer offsets. Consumers can commit their offsets in Kafka by writing them to a durable (replicated) and highly available topic. This article covers some internals of Offset management in Apache Kafka.
Offset in Kafka
The offset is a unique ID assigned to the partitions, which contains messages. The most important use is that it identifies the messages through ID, which are available in the partitions. In other words, it is a position within a partition for the next message to be sent to a consumer. A simple integer number which is actually used by Kafka to maintain the current position of a consumer. Kafka maintains two types of offsets, the current and committed offset.
Let’s first understand the current offset. Kafka sends some messages to us when we call a poll method. It is a pointer to the last record that Kafka has already sent to a consumer in the most recent poll. So, the consumer doesn’t get the same record twice and this is just because of the current offset.
Committed offset, the position that a consumer has confirmed about processing. In simple language, after receiving a list of messages, we want to process it. This process might be just storing them into Apache Hadoop Distributed File System (HDFS). Once we get the assurance that it has successfully processed the record, we may want to commit the offset. So, the committed offset is a pointer to the last record that has processed successfully.
Overview of Offset Management
A Kafka topic receives messages across a distributed set of partitions where they are stored. Each partition maintains the messages it has received in a sequential order where they are identified by an offset, also known as a position. Developers can take advantage of using offsets in their application to control the position of where their Spark Streaming job reads from, but it does require off management.
Managing offsets is very beneficial to achieve data continuity over the lifecycle of the streaming process. For example, upon shutting down the stream application or during an unexpected failure, offset ranges will be lost unless persisted in a non-volatile data store.
Storing offsets in external data stores.
Any external durable data store such as HBase, Kafka, HDFS, and ZooKeeper is used to keep track of which messages have already been processed.
It is worth mentioning that you can also store offsets in a storage system like HDFS. Storing it in HDFS is a less popular approach compared to the above options as HDFS has a higher latency compared to other systems like ZooKeeper and HBase. Additionally, writing offset ranges for each batch in HDFS can lead to the problem of a small file if not managed properly.