Apache Kafka Steaming: Everything you need to know before you Kickstart!

Suyash Srivastava
9 min readDec 26, 2020

Apache Kafka is a community distributed event streaming platform capable of handling trillions of events a day. Originally hatched at LinkedIn, Kafka is now an increasingly mainstream part of a broad open-source development community. In fact, Kafka has reached a pivotal moment as it’s being used as a central platform for managing streaming data in organizations, including IoT operations, fraud and security in the financial services industry, and tracking store inventory in the retail industry, among others.

So, if you a developer who wants to know just enough to get started on Kafka Steaming, this article is for you. I have attached articles/sources for further reading at the end of this blog.

What is Kafka really?

Kafka is a distributed streaming platform that is used to publish and subscribe to streams of records. Kafka is designed to allow your apps to process records as they occur. It is also used to streaming data into data lakes, applications, and real-time stream analytics systems.

Why does more than 1/3 of Fortune 500 Companies prefer Kafka?

Kafka has great performance, and it is stable, provides reliable durability, has a flexible publish-subscribe/queue that scales well with N-number of consumer groups, has robust replication, provides Producers with tunable consistency guarantees, and it provides preserved ordering at shard level (Kafka Topic Partition).

Now, let's see how Kafka is able to do all the above. Let's start with the simple high-level Architecture of Kafka:-

Source : Kafka Architecture

These are four main parts in a Kafka system:

  • Broker: Handles all requests from clients (produce, consume, and metadata) and keeps data replicated within the cluster. There can be one or more brokers in a cluster.
  • Zookeeper: Keeps the state of the cluster (brokers, topics, users).
  • Producer: Sends records to a broker.
  • Consumer: Consumes batches of records from the broker.

Kafka Broker

A Kafka cluster is made up of multiple Kafka Brokers. Each Kafka Broker has a unique ID (number). Kafka Brokers contain topic log partitions. Connecting to one broker bootstraps a client to the entire Kafka cluster. Meaning as soon as you are connected to an individual broker, you now have access to produce/consume data from any other broker on that cluster. Running a single Kafka broker is possible but it doesn’t give all the benefits that Kafka in a cluster can give, for example, data replication.

Source : Kafka Cluster

Topic: Generally, a topic refers to a particular heading or a name given to some specific inter-related ideas. In Kafka, the word topic refers to a category or a common name used to store and publish a particular stream of data. Basically, topics in Kafka are similar to tables in the database, but not containing all constraints. In Kafka, we can create n number of topics as we want. It is identified by its name, which depends on the user’s choice. A producer publishes data to the topics, and a consumer reads that data from the topic by subscribing it.

Partition: A topic is split into several parts which are known as the partitions of the topic. These partitions are separated in an order. The data content gets stored in the partitions within the topic. Therefore, while creating a topic, we need to specify the number of partitions(the number is arbitrary and can be changed later)

Kafka Producers :

Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion for load balancing or it can be done according to some semantic partition function (e.g., based on some key in the record).

Kafka Consumers :

Messaging is traditionally based on two models:

  1. Queuing
  2. Publish Subscribe

The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances. This lets you scale your processing. Unfortunately, queues aren’t multi-subscriber. After one process reads the data, it’s gone. Publish-subscribe allows you to broadcast data to multiple processes but has no way of scaling processing because every message goes to every subscriber.

The consumer group concept in Kafka generalizes these two concepts. As with the queue, the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

Source: Kafka Consumer

Offset: The offset is a simple integer number that is used by Kafka to maintain the current position of a consumer. Consumers can read messages starting from a specific offset and are allowed to read from any offset point they choose, allowing consumers to join the cluster at any point in time they see fit. Kafka maintains two types of offsets:-

Current Offset: The current offset is a pointer to the last record that Kafka has already sent to a consumer in the most recent poll. So, the consumer doesn’t get the same record twice because of the current offset.

Committed Offset: Once consumers are sure that we have successfully processed the record, we may want to commit the offset. So, the committed offset is a pointer to the last record that a consumer has successfully processed. In the event of partition rebalancing, when a new consumer is assigned a new partition, it should ask a question. Where to start? What is already processed by the previous owner? The answer to the question is the committed offset.

Leveraging the above information lets look at offset management techniques:-

  1. Auto Commit: You can control this feature by setting two properties: enable.auto.commit & auto.commit.interval.ms. The first property is by default true. So auto-commit is enabled by default. You can turn it off by setting enable.auto.commit to false. The second property defines the interval of auto-commit. The default value for this property is five seconds. So, in a default configuration, when you make a call to the poll method, it will check if it is time to commit. If you have passed five seconds since the previous call, the consumer will commit the last offset. So, Kafka will commit your current offset every five seconds.

The auto-commit is a convenient option, but it may cause second processing of records. Lets say you received 10 messages hence the consumer increases the current offset to 10. You take four seconds to process these ten messages and make a new call. Since you haven’t passed five seconds, the consumer will not commit the offset. First ten records are already processed, but nothing is committed yet. Right? Due to some reason the rebalance is triggered. So, the partition goes to a different consumer. Since we don’t have a committed offset, the new owner of partition should start reading from the beginning and process first ten records again. :(

2. Manual Commit: The solution to this particular problem is a manual commit. So, we can configure the auto-commit off and manually commit after processing the records. It can be done synchronous and asynchronous. Synchronous commit is a straightforward and reliable method, but it is a blocking method. It will block your call for completing a commit operation, and it will also retry if there are recoverable errors.
Asynchronous commit will send the request and continue. The drawback is that commitAsync will not retry. But there is a valid reason for such behaviour.

Zookeper : Kafka uses Zookeeper to manage service discovery for Kafka Brokers that form the cluster. Four main reasons why Kafka needs Zookeepers are :-

  1. Leader election: It has the responsibility to maintain the leader-follower relationship across all the partitions. If a node by some reason is shutting down, it’s the zookeper’s responsibility to tell a the replica to act as partition leader. ( Will look at this in a bit).
  2. Configuration Of Topics: The configuration regarding all the topics including the list of existing topics, the number of partitions for each topic, the location of all the replicas, list of configuration overrides for all topics and which node is the preferred leader, etc.
  3. Access control lists: Access control lists or ACLs for all the topics are also maintained within Zookeeper.
  4. Membership of the cluster: Zookeeper also maintains a list of all the brokers that are functioning at any given moment and are a part of the cluster.

Now, lets look at the most interesting topic “Replication in Kafka”.

What is a Replication?

Replication simply means keeping copies of the data over the cluster so as to promote the availability feature in any application. Replication in Kafka is at the Partition level. Each Partition has 0 or more replications over the cluster.

So, when we say a topic has a replication factor of 2 that means we will be having two copies of each of its partitions. Kafka considers that a record is committed when all replicas in the In-Sync Replica set (ISR) have confirmed that they have written the record to disk.

What is a Leader ?

Each partition has one server which acts as the “leader”. All writes and reads to a topic go through the leader and The leader is responsible for writing the message to its own in sync replica and, once that message has been committed, is responsible for propagating the message to additional replicas on different brokers. Each replica acknowledges that they have received the message and can now be called in sync.

What is In-Sync Replica ?

In-Sync Replicas are the replicated partitions that are in sync with its leader, i.e. those followers that have the same messages (or in sync) as the leader. It’s not mandatory to have ISR equal to the number of replicas. Followers replicate data from the leader to themselves by sending Fetch Requests periodically, by default every 500ms.

Why do we need Replication ?

For “Handling Failures” !

We rely on Zookeeper for detecting broker failures. We use a controller (embedded in one of the brokers) to receive all Zookeeper notifications about the failure and to elect new leaders. If a leader fails, the controller selects one of the replicas in the ISR as the new leader and informs the followers about the new leader.

By design, committed messages are always preserved during leadership change whereas some uncommitted data could be lost. The leader and the ISR for each partition are also stored in Zookeeper and are used during the failover of the controller. Both the leader and the ISR are expected to change infrequently since failures are rare.

For clients, a broker only exposes committed messages to the consumers. Since committed data is always preserved during broker failures, a consumer can automatically fetch messages from another replica, using the same offset.

Finally, putting the pieces together.

A distributed file system, like HDFS, allows static file storage for batch processing. These types of systems allow the storing and processing of historical data from the past.

A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built in this way process future data as it arrives.

Kafka combines the capabilities of a distributed file system and a traditional enterprise messaging system, to deliver a platform for streaming applications and data pipelines.

Kafka allows producers to wait on acknowledgment. A write isn’t considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.

Kafka allows consumers to control their read position.

For streaming data pipelines, the combination of subscription to real-time events makes it possible to use Kafka for very low-latency pipelines. Kafka’s ability to store data reliably means it can be used for critical data, where data delivery must be guaranteed; or for integration with offline systems that load data only periodically or may go down for extended periods of time for maintenance. The stream processing facilities make it possible to transform data as it arrives.

Thanks for reading!

In the next article, we will look at how to setup a fully functional Kafka ecosystem.

The below articles were very helpful :-

--

--