JAX London 2014: A retrospective
Meet LinkedIn's Messaging System

LinkedIn’s Kafka Project: Moving to Apache?

Jessica Thornsby
LinkedIn-s-Kafka-Project-Moving-to-Apache

We speak to Principle Software Engineer at LinkedIn, about Kafka’s proposed move.

  • Jun Rao

    Jun Rao conducted research on MapReduce, content-oriented scalable databases, query processing in relational database , in-memory databases, and data warehouses. He developed and delivered the partitioning advisor in DB2 V8.2 (parallel version). He has extensive experience on Hadoop and HBase.

After starting life at LinkedIn and being open sourced earlier this year, the Kafka messaging system has now been proposed as a new Apache Incubator project. But what makes Kafka different to other messaging systems? And how relevant is Kafka beyond LinkedIn? JAXenter spoke to Principle Software Engineer at LinkedIn, Jun Rao, to find out.

JAXenter: Kafka has been proposed as a new Apache Incubator project. Can you give us an introduction to Kafka?

Jun Rao: Kafka is a distributed, persistent, and high throughput messaging system. It combines the benefit of traditional messaging systems such as ActiveMQ (e.g., low latency, real time consumption) and log aggregators such as Flume (high throughput). Kafka can be used to collect a large amount of log data and load it to Hadoop or a data warehouse for offline consumption. It can also serve the same data for real time consumption.

JAXenter: How is Kafka unique, in the world of stream processing?

Jun: Kafka differs from traditional messaging systems in 2 ways. First, it’s designed to be a distributed system from the beginning. One can distribute messages into multiple brokers in Kafka. Multiple consumers can jointly consume messages in a topic in parallel. Consumers detect and recover from broker failures automatically. Second, Kafka is designed for high throughput. We have a batch API for both the producer and the consumer, a simple and efficient on-dsk storage format, and efficient data transfer over the socket.

JAXenter: Kafka was developed internally at LinkedIn, to meet the company’s particular use cases. How do you think Kafka could be useful to the wider, Apache ecosystem?

Jun: At LinkedIn, we are using Kafka primarily for collecting and consuming all kinds of log data (user activity and software logs), both offline and online. In the future, we also plan to use Kafka for traditional queueing usage (as a replacement of ActiveMQ). There are problems common to a lot of companies. So, we hope by becoming an Apache project, Kafka can be more widely adopted in the industry, which will in turn help improve the quality of Kafka over time.

JAXenter: What technologies are at work in Kafka?

Jun: Kafka exploits Apache Zookeeper for failure detection and coordination among producers, brokers and consumers. It also uses the sendfile API to transfer bytes from a file in the broker to the consumer directly, without extra copying in the application space.

JAXenter: What are the next steps, for the Kafka project?

Jun: We are working on 2 major features in the near future. The first one is to enable compression of messages to reduce both the amount of data stored and transferred. The second one is built-in replication so that each message is stored redundantly in multiple brokers. This improves both the durability and the availability of the system.

Author
Comments
comments powered by Disqus