Kafka Core Concepts
Apache Kafka is used primarily to build real-time data pipelines and streaming applications.
Topics
Similar to how databases have tables to organize data, Kafka use the concept of topics to organize related messages. But unlike database tables, Kafka topics are not queryable. Instead, we use Kafka producers to send data to the topic and Kafka consumers to read data from the topic.
A topic is identified by its name, it can contain any kind of messages in any format, and the sequence of all these messages is called a data stream.
Data in Kafka topics is deleted after one week by default (the default message retention period), but it is configurable.
kafka topics are immutable, meaning once a message is written to a topic, it cannot be changed or deleted. This immutability ensures that the data stream remains consistent and reliable.
Partition
Topics are broken down into partitions. Each partition is an ordered, immutable sequence of messages that is continually appended to. Partitions allow Kafka to scale horizontally by distributing the load across multiple servers.
The number of partitions for a topic is defined at the time of topic creation and numbered starting from 0 to N-1, where N is the total number of partitions.
If a topic has more than one partition, kafka guarantees the order of messages within a partition, but not across partitions.
Offset
The offset is an integer value that Kafka adds to each message as it is written to a partition. It acts as a unique identifier for each message within that partition. Offsets are sequential and start from 0 for the first message in a partition.
The data in Kafka topics are deleted over time, but the offsets are never reused. They are continually incremented in a never-ending sequence.
Producers
Applications that send data into topics are called Kafka producers.
A Kafka producer sends messages to a topic and messages are distributed across the topic’s partitions according to a mechanism such as key hashing or round-robin.
For a message to be successfully written into a Kafka topic, the producer must specify a level of acknowledgment (ack):
acks=0: The producer does not wait for any acknowledgment from the broker.acks=1: The producer waits for an acknowledgment from the leader broker of the partition.acks=all: The producer waits for acknowledgments from all partition replicas.
Message Key
Each event message contains an optional key and a value:
- If the key is not specified, the message is sent in a round-robin fashion to the partitions of the topic.
- If the key is specified, the message is sent to the partition determined by hashing the key; all messages with the same key will go to the same partition, ensuring that they are processed in order.
- A key can be anything to identify a message - a string, a number, or even a complex object serialized to a byte array.
Kafka messages keys are commonly used when there is a need for message ordering for all messages sharing the same field.
Message Structure
- Key: identifier used to determine the partition, optional, serialized into binary format
- Value: the actual message content, optional, serialized into binary format
- Compression Type: the compression algorithm used for the message value (options: none, gzip, snappy, lz4, zstd)
- Headers: a list of optional message headers in the form of key-value pairs
- Partition + Offset: the message will receive a partition and offset when it is written to the topic; the combination of topic-partition-offset uniquely identifies the message.
- Timestamp: the time when the message was produced, can be set by the producer or automatically generated by Kafka.
Message Serialization
Kafka messages are serialized into binary format (byte array) before being sent to the topic. This serialization process converts the message into a byte array, which is necessary for efficient storage and transmission.
Consumers
Applications that read data from topics are called Kafka consumers.
Consumers can read from one or more partitions at a time in Kafka, and data is read in order within each partition.
A consumer always read data from a lower offset to a higher offset and cannot read data backwards.
By default, Kafka consumers will only consume data that was produced after it first connected to Kafka. However, they can be configured to read from the beginning of the topic or from a specific offset.
Kafka consumers use the pull model to read data, meaning they request data from the broker when they are ready to process it, rather than being pushed data by the broker.
Deserialization
Data being consumed must be deserialized from binary format back into its original format, which is done using a deserializer.
The serialization and deserialization format of a topic must not change during the lifetime of the topic. It’s considered best practice to create a new topic if the serialization format needs to change.
Consumer Groups
A consumer group is a group of consumers that work together to consume messages from one or more topics. Each consumer in the group reads from a unique set of partitions, allowing for parallel processing of messages:
- All consumers in a group share the same group ID.
- A consumer can consume from multiple partitions, but each partition can only be consumed by one consumer in the group at a time.
- If a consumer fails or is removed from the group, Kafka will rebalance and assign its partitions to other consumers in the group.
- If there are more consumers than partitions, then some consumers will be idle and not receive any messages.
Consumer Offsets
Consumers regularly commit the latest processed message, also known as the consumer offset, to Kafka. This process is not done for every message consumed (this would be inefficient), and instead is a periodic process.
Kafka brokers use an internal topic called __consumer_offsets to store the offsets of messages that have been consumed by each consumer group.
This allows consumers to resume reading from the last committed offset in case of:
- Kafka client crashes
- A rebalance occurs
- New Consumer joins the group
Delivery semantics for consumers
At Most once:
- Offsets are committed as soon as the message is received
- if the processing goes wrong, the message is lost (it won’t be read again)
At Least once (usually preferred):
- Offsets are committed after the message is processed
- If the processing goes wrong, the message will be read again (it may be processed multiple times)
- This can lead to duplicate processing, so the application must handle idempotency.
Exactly once:
- This can be achieved for Kafka topics to Kafka topics workflows using the transactional API.
- For Kafka topics to external system workflows, to effectively achieve exactly once semantics, you must use an idempotent consumer.
In practice, at least once with idempotent processing is the most desirable and widely used delivery semantics.
Brokers
A single Kafka server is called a Kafka broker.
An ensemble of Kafka brokers working together is called a Kafka cluster. A broker in a cluster is identified by a unique numeric ID.
If there are multiple brokers in a cluster, then partitions for a given topic will be distributed among the brokers evenly to achieve load balancing and scalability.
Every broker in the cluster has metadata about all other brokers:
- any broker in the cluster is also called a bootstrap server
- a client can connect to any broker in the cluster to get the metadata about the cluster
In practice, it is common for the Kafka client to connect to multiple bootstrap servers to ensure high availability and fault tolerance.