In a kafka message, Key is not a mandatory field, but is must for compaction and grouping. Value is where the actual content/payload goes. Following picture shows a kafka message structure:
In kafka ecosystem, Producer is the one responsible for producing and pushing messages to kafka topics. A Producer is responsible for following actions:
- First: Producer prepares the message to be sent. Message is sent as an instance of ProducerRecord
- Second: Producer fetches metadata and it then knows about the available partitions to start with. Now, comes the time to compute partition for the given ProducerRecord. Out of the box a DefaultPartitioner is used. Records with same key will be routed to same partition. Records with no key are distributed round robin fashion. You can provide your own partitioner by implementing Partitioner interface, if you wish to.
- Third (Optimisation):RecordAccumulator batches all the ProducerRecords by topic per partition – meaning all the messages for topic-A which are to be sent to partition-0 are batched together. Batching can be controlled by size or linger time, based on the configuration RecordAccumulator will call the Sender to dispatch messages. Sender knows which batches are for which broker and which partition as we already deduced this. Always, a leader is chosen to write data and it is leaders responsibility to replicate and acknowledge based on configuration. Sender will again group the batches by brokers and start sending.
- Fourth: Before we jump to next component, lets understand what a TopicPartition is?? Following picture should help:
- Consumer is responsible for consuming data from Kafka topics and processing them. Just like how Producer pushes data to leaders, Consumer also will consume only from leader of the partition. It will bootstrap and connect to a kafka broker to fetch metadata, just like how producer does. Consumer groups help with increasing parallelism while processing messages from topics. Each consumer in a group is assigned a partition and it will consume from the leader of that partition. If partitions are more than consumers than load is distributed evenly. If consumers are more than partitions, some consumers will be idle. Group coordinator makes sure who reads from where. Each consumer knows their reading position using offset. Offsets are maintained in an internal topic (_consumer_offsets), so if a consumer crashes it comes back and start where it left.
- A Broker is a simple JVM instance that participates in a cluster. It stores partitions of a topic in commit logs. Also, becomes leader of one or more partitions of a topic. When a topic is created, whichever broker that has the first replica for a partition is preferred as the leader for that partition. As a leader broker instance is who replicates data to the other brokers, based on config. If a leader node fails, an ISR (in sync replica) node is promoted to be next leader. Any kafka broker can act as a bootstrap server. Brokers update zookeeper about metadata info (which topic, what partition, what offset, etc)
- With all the quick reading we did above, lets see how a kafka cluster looks like:
Finally, lets understand little bit about the zookeeper. But, soon this component will be decommissioned from kafka ecosystem, infact new versions of kafka don’t need this already:
- Zookeeper communicate with kafka controller (controller is a service that runs on all the brokers but is active only on one of the brokers in cluster, this active controller broker is responsible for maintaining leader-follower sequence in the cluster, Controller election takes place as follows: when cluster comes up all nodes make calls to /controller and first one to succeed is made the controller and rest get exception. When controller dies it is removed and re-election happens same way) to elect leaders for topic partitions. Zookeeper keeps details of list of existing topics and each of their configurations, maintains the latest metadata of the kafka cluster, it receives heartbeat from controller