When we talk about Kafka – Its all about messaging. Messages in a kafka are logically represented as Topics. So, what is a topic then?
Topic is a central abstraction in a Kafka ecosystem. It is a logical entity, that spans across the length of kafka cluster. Topics are kept in a daemon service called Message Broker and all the messages that are sent to a topic are written to the file system of the message broker hosting that specific topic. Messages in a topic are stored in a time ordered data stream. Hence, messages in kafka are immutable meaning they cannot be changed once they are sent to a topic. Each message in a topic is assigned a unique, sequential ID called an offset.
If you can recollect, in our kafka demo post we created a topic named mytopic with 3 partitions? This is the right time to introduce Partitions.
Partition in a kafka topic physically represents the data received from producers. Each partition maintains a commit log file where data that was routed is stored. A Kafka topic can have 1 or more partitions. Recall in our the demo post we created myTopic with replication factor as 3. When we set the replication factor as 3, what it means is that we want to maintain 3 copies of the data for each partition on ideally three different message brokers. Partitions are a vital reason kafka achieves parallelism, high throughput, scalability and fault tolerance. A partition is the lowest data denominator for a topic, meaning a partition cannot be further split/divided. When a producer pushes a message to a topic, it is sent to one of the partitions in a round robin fashion. Users can define their own partitioners to present their own partitioning strategy.
Each partition is immutable and working in a append only mode (at least from client perspective). Consumers subscribed to kafka can replay the commit log file and recreate their state. The concept of immutability is little different for compacted topics (AKA compaction), we will talk about compaction in next posts.