LinkedIn’s Kafka Project: Moving to Apache?
We speak to Principle Software Engineer at LinkedIn, about Kafka’s proposed move.
After starting life at LinkedIn and being open sourced earlier
this year, the Kafka messaging system has now been proposed as a
new Apache Incubator project. But what makes Kafka different to
other messaging systems? And how relevant is Kafka beyond LinkedIn?
JAXenter spoke to Principle Software Engineer at LinkedIn, Jun Rao,
to find out.
JAXenter: Kafka has been proposed as a new
Apache Incubator project. Can you give us an introduction to
Jun Rao: Kafka is a distributed, persistent,
and high throughput messaging system. It combines the benefit of
traditional messaging systems such as ActiveMQ (e.g., low latency,
real time consumption) and log aggregators such as Flume (high
throughput). Kafka can be used to collect a large amount of log
data and load it to Hadoop or a data warehouse for offline
consumption. It can also serve the same data for real time
JAXenter: How is Kafka unique, in the world of
Jun: Kafka differs from traditional messaging
systems in 2 ways. First, it’s designed to be a distributed system
from the beginning. One can distribute messages into multiple
brokers in Kafka. Multiple consumers can jointly consume messages
in a topic in parallel. Consumers detect and recover from broker
failures automatically. Second, Kafka is designed for high
throughput. We have a batch API for both the producer and the
consumer, a simple and efficient on-dsk storage format, and
efficient data transfer over the socket.
JAXenter: Kafka was developed internally at
LinkedIn, to meet the company’s particular use cases. How do you
think Kafka could be useful to the wider, Apache ecosystem?
Jun: At LinkedIn, we are using Kafka primarily
for collecting and consuming all kinds of log data (user activity
and software logs), both offline and online. In the future, we also
plan to use Kafka for traditional queueing usage (as a replacement
of ActiveMQ). There are problems common to a lot of companies. So,
we hope by becoming an Apache project, Kafka can be more widely
adopted in the industry, which will in turn help improve the
quality of Kafka over time.
JAXenter: What technologies are at work in
Jun: Kafka exploits Apache Zookeeper for
failure detection and coordination among producers, brokers and
consumers. It also uses the sendfile API to transfer bytes from a
file in the broker to the consumer directly, without extra copying
in the application space.
JAXenter: What are the next steps, for the
Jun: We are working on 2 major features in the
near future. The first one is to enable compression of messages to
reduce both the amount of data stored and transferred. The second
one is built-in replication so that each message is stored
redundantly in multiple brokers. This improves both the durability
and the availability of the system.