Apache Bahir against a “wild west” of Big Data projects
The Apache Software Foundation recently announced Apache Bahir as a Top-Level Project. We asked Chris Mattmann, a Principal Data Scientist and the Chief Architect in the Instrument and Data Systems section, at the Jet Propulsion Laboratory (JPL) in Pasadena, California, to comment on this project and reveal what’s next for Apache Bahir.
Apache Bahir bolsters Big Data processing by serving as a home for existing connectors that initiated under Apache Spark, as well as provide additional extensions/plugins for other related distributed system, storage, and query execution systems.
We talked to Chris Mattmann, a Principal Data Scientist and the Chief Architect in the Instrument and Data Systems section, at the Jet Propulsion Laboratory (JPL) in Pasadena, California, about how Apache Bahir came into being and what’s next for this project.
JAXenter: What is the idea behind Apache Bahir? Was there a gap to be filled?
Chris Mattmann: Apache Spark by itself is a framework and system for performing large scale in memory, fast analytics.
There are plugins around Spark, and there were a set of “Spark Connectors” which were developed that were originally going to be put on GitHub. Instead, it made more sense to keep them closely knit to the Apache community and thus Apache Bahir was born.
JAXenter: According to the press release, Bahir is „a place to curate extensions related to distributed analytic platforms following the Apache Governance“ – how exactly does this work?
Chris Mattmann: The idea is that Apache Bahir is an Apache community project and it abides by Apache’s principles of meritocracy. Work done should earn rights to commit and help guide and steward the project. You cannot guarantee that type of thing at GitHub.
At the ASF, the communities follow these principles that ensure long-lasting ASF projects.
It made more sense to keep the“Spark Connectors” closely knit to the Apache community and thus Apache Bahir was born.
JAXenter: Can you give us an example of a typical use case?
Chris Mattmann: One use case is Spark streaming and the connector for streaming-mqtt. Apache Spark natively is optimized for low latency environments and the streaming-mqtt plugin extends Spark capabilities through a message queue which is better suited for high latency and unreliable networks.
JAXenter: As the Chief Architect at NASA’s Instrument and Science Data Systems Section, do you use Apache Bahir in your work?
Chris Mattmann: We are extremely interested in and exploring streaming-mqtt from Apache Bahir in the context of the JPL Airborne Snow Observatory (ASO) project which must operate in the Mammoth Lakes area that has poor cell phone and Internet connectivity and reliability.
Apache Bahir and streaming-mqtt seem to be a perfect fit for this environment in which we are processing terabytes of snow depth and snow melt rate data from our LIAR and spectrometer instruments on the JPL ASO.
A “wild west” of Big Data projects is very hard to curate.
JAXenter: Big Data is becoming more and more important — what are the next steps? How does the future look like in your view?
Chris Mattmann: Having projects operate in a meritocratic community oriented fashion, especially in the Big Data space, is something that to me is imperative. A “wild west” of Big Data projects is very hard to curate – having projects at the ASF and readily available governed by community oriented processes at Apache are what is required to take the next steps in my mind for Big Data.
Thank you very much!