“In a way, Apache Beam is the glue that connects many big data systems together”
Apache Beam has successfully graduated from incubation, becoming a new Top-Level Project at the Apache Software Foundation. We invited the Apache Software Foundation’s Davor Bonaci and Jean-Baptiste Onofré to talk about the project’s journey to becoming a Top-Level Project and concrete plans for its future.
JAXenter: What is the idea behind Apache Beam?
Davor Bonaci and Jean-Baptiste Onofré: Apache Beam is a data processing system that runs at any scale. Its unified programming model and software development kits (SDKs) enable users to define their batch and streaming data processing pipelines. Beam’s runners enable users to execute those pipelines on many processing engines, providing portability and future-proofing, as well as avoiding engine, vendor or cloud “lock-in”.
The goal of Apache Beam is to raise the level of abstraction further than any existing system — to decouple the user’s business logic from the considerations of the underlying engine. This, in turn, enables user’s logic to run on any engine.
Well… this has been done before! Two decades ago, modern programming languages were C/C++ but the compiled executable would run on a single operating system only. Then, Java came along — it raised the level of abstraction, introduced byte code, and made the compiled application portable across operating systems. It became successful and closely followed by C#, Python, Scala, and many others. What these systems have achieved in a general sense, Beam aims to achieve in a very narrow domain — we focus on embarrassingly-parallel, distributed data processing only.
JAXenter: Tell us more about what’s under this project’s hood: How does Beam work?
Davor Bonaci and Jean-Baptiste Onofré: At the top level, Beam offers multiple SDKs. Users leverage those to construct their own data processing pipeline. Beam then takes that pipeline, breaks it down into individual pieces and transforms it into an engine-independent and (mostly) language-independent form. That pipeline representation is then passed onto one of Beam’s runners, which further adapts the pipeline for execution on the given processing engine. In a way, Beam is the glue that connects many big data systems together.
What Java, C#, Scala and Python have achieved in a general sense, Beam aims to achieve in a very narrow domain.
JAXenter: Can you describe a typical use case where the benefits of Beam shine through?
Davor Bonaci and Jean-Baptiste Onofré: You can use Apache Beam for any data processing needs, covering anything from a simple batch ETL pipeline to a complex event-time-based streaming pipeline. Some examples include:
- Finding patterns in data.
- Analyzing genomes.
- Fraud detection.
- Real-time gaming.
… really, anything that needs any kind of data processing.
Beam’s clean abstractions make it easy to develop such data processing pipelines. Even more, executing such pipelines on a local machine, or an on-premise cluster, or in the cloud, is as simple as running one command, without any code changes.
JAXenter: Tell us about the history of Beam. How did the project begin?
Davor Bonaci and Jean-Baptiste Onofré: Apache Beam traces its roots back to the original MapReduce system, which Google published in a 2004 paper, and that fundamentally changed the way we do distributed data processing.
At this point early on, things diverge a little bit. Inside Google, engineers kept innovating and refining the core methodology, and those ideas were shared with the wider community in more scientific papers. Outside of Google, the open source community created its own MapReduce implementation in Apache Hadoop. An entire ecosystem developed, which we all know and love, vast majority within the Apache Software Foundation. Over time, amazing innovation happened in both of these sibling branches, with occasional influence of one on the other.
Apache Beam traces its roots back to the original MapReduce system.
In 2014, Google launched Google Cloud Dataflow, which was based on technology that evolved from MapReduce but included newer ideas like FlumeJava’s improved abstractions and MillWheel’s focus on streaming and real-time execution. Google Cloud Dataflow included from the start both a new programming model for writing data processing pipelines, as well as a fully managed service for executing them.
In 2016, Google, along with a handful of partners, donated the programming model to the Apache Software Foundation, as the incubating project Apache Beam. Since then, Apache Beam has graduated from incubation, becoming a new top-level project at the foundation.
JAXenter: What are your future plans for the project?
Davor Bonaci and Jean-Baptiste Onofré: We are working on improving end-to-end user experience to create truly frictionless portability that “just works”. The Java community coined the phrase “write once, run anywhere” but it took them some time for it to truly materialize. Similar considerations may apply to Beam too.
Besides polishing user experience, the next major milestone for the project is the availability of the first release that guarantees backward compatibility, which would make our project ready for enterprise deployments.
In terms of functionality, we plan to keep improving the core model and distill even more complex data processing patterns into simple and portable abstractions. Finally, we are focused on extending our ability to interconnect with additional related systems and user communities. Some examples include:
- New SDKs, such as Python.
- New DSLs, such as Scala or SQL.
- New IOs, which improve our ability to read/write data from/to multiple storage/messaging systems.
- New runners, such as Apache Gearpump (incubating).
With this, Beam will come close to its vision of running any data processing logic on any engine.
Thank you very much!