Beginner Kafka tutorial: Get started with distributed systems

Distributed systems are collections of computers that work together to form a single computer for end-users. They allow us to scale at exponential rates, and they can handle billions of requests and upgrades without downtime. Apache Kafka has become one of the most widely used distributed systems on the market today.

According to the official Kafka site, Apache Kafka is an “open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.” Kafka is used by most Fortune 100 companies, including big tech names like LinkedIn, Netflix, and Microsoft.

In this Apache Kafka tutorial, we’ll discuss the uses, key features, and architectural components of the distributed streaming platform. Let’s get started!

We’ll cover:

  • What is Kafka?

What is Kafka?

Apache Kafka is an open-source software platform written in the Scala and Java programming languages. Kafka started in 2011 as a messaging system for LinkedIn but has since grown to become a popular distributed event streaming platform. The platform is capable of handling trillions of records per day.

Kafka is a distributed system comprised of servers and clients that communicate through a TCP network protocol. The system allows us to read, write, store, and process events. We can think of an event as an independent piece of information that needs to be relayed from a producer to a consumer. Some relevant examples of this include Amazon payment transactions, iPhone location updates, FedEx shipping orders, and much more. Kafka is primarily used for building data pipelines and implementing streaming solutions.

Kafka allows us to build apps that can constantly and accurately consume and process multiple streams at very high speeds. It works with streaming data from thousands of different data sources. With Kafka, we can:

  • Process records as they occur

The Kafka publish-subscribe messaging system is extremely popular in the Big Data scene and integrates well with Apache Spark and Apache Storm.

Kafka use cases

You can use Kafka in many different ways, but here are some examples of different use cases shared on the official Kafka site:

  • Processing financial transactions in real-time

Key features of Kafka

Let’s take a look at some of the key features that make Kafka so popular:

  • Scalability: Kafka manages scalability in event connectors, consumers, producers, and processors.

Components of Kafka architecture

Before we dive into some of the components of the Kafka architecture, let’s take a look at some of the key concepts that will help us understand it:

Kafka Consumer Groups

Consumer groups consist of a cluster of related consumers that perform certain tasks, such as sending messages to a service. They can run multiple processes at one time. Kafka sends messages from partitions of a topic to the consumers in the group. When the messages are sent to the group, each partition is read by a single consumer within the larger group.

Kafka Partitions

Kafka topics are divided into partitions. These partitions are reproduced across different brokers. Within each partition, multiple consumers can read from a topic simultaneously.

Topic Replication Factor

The topic replication factor ensures that data remains accessible and that deployment runs smoothly and efficiently. If a broker goes down, topic replicas on different brokers stay within those brokers to make sure we can access our data.

Kafka Topics

Topics help us organize our messages. We can think of them as channels that our data goes through. Kafka producers can publish messages to topics, and Kafka consumers can read messages from topics that they are subscribed to.

Now that we’ve covered some foundational concepts, we’re ready to get into the architectural components!

Kafka APIs

Kafka has four essential APIs within its architecture. Let’s take a look at them!

Kafka Producer API

The Producer API allows apps to publish streams of records to Kafka topics.

Kafka Consumer API

The Consumer API allows apps to subscribe to Kafka topics. This API also allows the app to process streams of records.

Kafka Connector API

The Connector API connects apps or data systems to topics. This API helps us build and manage producers and consumers. It also enables us to reuse connections across different solutions.

Kafka Streams API

The Streams API allows apps to process data using stream processing. This API enables apps to take in input streams from different topics and process them with a stream processor. Then, the app can produce output streams and send them out to different topics.

Kafka Brokers

A single Kafka server is called a broker. Typically, multiple brokers operate as one Kafka cluster. The cluster is controlled by one of the brokers, called the controller. The controller is responsible for administrative actions like assigning partitions to other brokers and monitoring for failures and downtime.

Partitions can be assigned to multiple brokers. If this happens, the partition is replicated. This creates redundancy in case one of the brokers fails. A broker is responsible for receiving messages from producers and committing them to disk. Brokers also receive requests from consumers and respond with messages taken from partitions.

Here’s a visualization of a broker hosting several topic partitions:

Kafka Consumers

Consumers receive messages from Kafka topics. They subscribe to topics, then receive messages that producers write to a topic. Normally, each consumer belongs to a consumer group. In a consumer group, multiple consumers work together to read messages from a topic.

Let’s take a look at some of the different configurations for consumers and partitions in a topic:

Number of consumers and partitions in a topic are equal

In this scenario, each consumer reads from one partition.

Number of partitions in a topic is greater than the number of consumers in a group

In this scenario, some or all of the consumers read from more than one partition.

Single consumer with multiple partitions

In this scenario, all partitions are consumed by a single consumer.

Number of partitions in a topic is less than the number of consumers in a group

In this scenario, some of the consumers will be idle.

Kafka Producers

Producers write messages to Kafka that consumers can read.

Advanced concepts to explore next

Congrats on taking your first steps with Apache Kafka! Kafka is an efficient and powerful distributed system. Kafka’s scaling capabilities allow it to handle large workloads. It’s often the preferred choice over other message queues for real-time data pipelines. Overall, it’s a versatile platform that can support many use cases. You’re now ready to move on to some more advanced Kafka topics such as:

  • Producer serialization

Happy learning!

Coding is like skateboarding: you can’t learn new skills just by watching someone else. Master in-demand coding skills through Educative’s interactive courses.