LinkedIn’s Kafka Project: Moving to Apache?
We speak to Principle Software Engineer at LinkedIn, about Kafka’s proposed move.
After starting life at LinkedIn and being open sourced earlier this year, the Kafka messaging system has now been proposed as a new Apache Incubator project. But what makes Kafka different to other messaging systems? And how relevant is Kafka beyond LinkedIn? JAXenter spoke to Principle Software Engineer at LinkedIn, Jun Rao, to find out.
JAXenter: Kafka has been proposed as a new Apache Incubator project. Can you give us an introduction to Kafka?
Jun Rao: Kafka is a distributed, persistent, and high throughput messaging system. It combines the benefit of traditional messaging systems such as ActiveMQ (e.g., low latency, real time consumption) and log aggregators such as Flume (high throughput). Kafka can be used to collect a large amount of log data and load it to Hadoop or a data warehouse for offline consumption. It can also serve the same data for real time consumption.
JAXenter: How is Kafka unique, in the world of stream processing?
Jun: Kafka differs from traditional messaging systems in 2 ways. First, it’s designed to be a distributed system from the beginning. One can distribute messages into multiple brokers in Kafka. Multiple consumers can jointly consume messages in a topic in parallel. Consumers detect and recover from broker failures automatically. Second, Kafka is designed for high throughput. We have a batch API for both the producer and the consumer, a simple and efficient on-dsk storage format, and efficient data transfer over the socket.
JAXenter: Kafka was developed internally at LinkedIn, to meet the company’s particular use cases. How do you think Kafka could be useful to the wider, Apache ecosystem?
Jun: At LinkedIn, we are using Kafka primarily for collecting and consuming all kinds of log data (user activity and software logs), both offline and online. In the future, we also plan to use Kafka for traditional queueing usage (as a replacement of ActiveMQ). There are problems common to a lot of companies. So, we hope by becoming an Apache project, Kafka can be more widely adopted in the industry, which will in turn help improve the quality of Kafka over time.
JAXenter: What technologies are at work in Kafka?
Jun: Kafka exploits Apache Zookeeper for failure detection and coordination among producers, brokers and consumers. It also uses the sendfile API to transfer bytes from a file in the broker to the consumer directly, without extra copying in the application space.
JAXenter: What are the next steps, for the Kafka project?
Jun: We are working on 2 major features in the near future. The first one is to enable compression of messages to reduce both the amount of data stored and transferred. The second one is built-in replication so that each message is stored redundantly in multiple brokers. This improves both the durability and the availability of the system.