Open-source streaming analytics

Big Data with Apache Apex

Desmond Chan
Big Data image via Shutterstock

It’s touted as the industry’s only open-source enterprise grad unified stream and batch processing platform. Apache Apex community manager Desmond Chan show’s us what exactly that means and how this open-source engine handles big data.

The promise of big data is fuelling huge industry growth. For more than a decade, Apache Hadoop has promised the power of big data without the costs that normally come with expensive storage and powerful processing systems. But batch processes, like MapReduce, and early attempts at real time have done little to bring streaming analytics to Hadoop. Without accessible streaming analytics, valuable insights cannot be acted upon. As Forrester points out in the The Forrester Wave for Big Data Streaming Analytics Platforms, exploiting perishable insights is a huge, untapped opportunity.

With the sheer velocity of data, companies struggle to make decisions in a timely manner. To combat this problem and help democratise streaming analytics, DataTorrent has introduced Apache Apex, an open-source, enterprise-grade unified stream and fast batch processing platform. Apache Apex enables enterprises to act on data in real time, making more effective and better use of data as it happens.

Apache Apex moved quickly from code to acceptance into the Apache Software Foundation. Following the July 30 the source code posting to GitHub, Project Apex was submitted for incubation and accepted under incubation as Apache Apex on August 17. With more than 500 members in the Apache Apex meet-up group and affiliations from individuals at companies like Apple, Barclays, Capital One, DirecTV, Hortonworks, and MapR, the project has been welcomed as a solution to longstanding challenges in the Hadoop ecosystem.

Enterprises now have free access to the DataTorrent RTS 3 core engine and the Malhar library of operators for common business logic and easy integration of Hadoop with other systems. Apache Apex was architected to fulfil the 10-year promise of Hadoop with a platform that spurs innovation and speeds development.

Apache Apex: architected differently

Apache Apex was made possible by the introduction of YARN. YARN brought the capability of exploring how distributed resources handling big data could perform “a lot of things”, thus going beyond the early MapReduce paradigm, and in a way beyond batch or even compute-going-to-data paradigms. YARN presented the capability to allow big data to not just become big in size, but broader in use cases. With its enabling capability as a Hadoop facilitator, YARN has pushed Hadoop towards realising its true potential and opened the doors for Apache Apex.

Apache Apex is the first ever YARN native engine. This native architecture allows Apache Apex to fulfil the decade-old promise of productizing Hadoop. Because big data is tough to envisage in its entirety, the platform had to be created to become the basis for driving big data processing needs, in a batch paradigm, streaming paradigm, or both. Apache Apex is the industry’s only open-source enterprise-grade engine capable of handling batch data as well as steaming data needs. Apache Apex is groomed to drive the highest value for businesses operating in highly data-intensive environments.

Here is why Apache Apex is a go-to solution for bringing big data projects to success.

Simplicity and expertise

Programming Apache Apex API is very simple as developers can use Java or Scala. An application is a directed acyclic graph (DAG) of multiple operators. The operator developer only needs to implement a simple process() call. This API allows users to plug in any function (or UDF) for processing incoming events. The single-thread execution and application-level JAVA expertise are the top reasons why Apex enables big data teams to develop applications within weeks, and allows them to go live in as little as three months. Not only is Apex simple to deploy and customise, but the expertise required to leverage its full capabilities is easily available.

Code reuse

Developers require minimal training to build big data applications on Apache Apex. What’s more, they do not require significant changes in their business logic; a minimal tweaking of their existing code suffices. A complete separation of functional specification from operational specification greatly enhances reuse of existing code. Additionally, Apex enables defining reusable modules. It enables the same business logic to be used for stream as well as batch.

Apex is a data-in-motion platform that allows for a unification of processing of never-ending streams of unbounded data (streaming job), or bounded data in files (batch job). Organisations can build applications to suit their business logic, and extend the applications across both batch as well as streaming jobs. Apache Apex architecture can handle reading from and writing to message buses, file systems, databases or any other sources. As long as these sources have client code that can be run within a JVM, the integration works seamlessly.


Apex is built for enhanced operability, such that applications built on Apex need only worry about the business logic. For native fault tolerance, the Apex platform ensures that data is not lost, master (Meta data) is backed up, and equally importantly, the application state is retained in HDFS (a persistent store). Businesses can choose other persistent stores with a DFS interface if need be. With Apex, fault tolerance is native to Hadoop and does not require an additional system outside of Hadoop to hold state, nor any extra code for users to write. Apex applications can recover from Hadoop outages from their last-known state that persists in HDFS.

It is easy to run two Apex applications of different versions in the same Hadoop cluster, or even change applications dynamically. It is easy to operate and upgrade applications. No new scheduling overhead exists once the application is running, unless the change in resources is asked by the application or by YARN. Apex has in-built data-in-motion constructs that enable data flow to be in million(s) of events/second on a single core. Apex is thus an easy-to-leverage, highly scalable platform that is built on the same security standards as Hadoop.

Integration and ease of use

The Apex platform comes with support for web services and metrics. This enables ease of use and easy integration with current data pipeline components. DevOps teams can monitor data in action using existing systems and dashboards with minimal changes, thereby easily integrating with the current setup. With different connectors and the ease of adding more connectors, Apex easily integrates with an existing dataflow.

Leverage investments in Hadoop

Apex is a native YARN big data-in-motion platform with a vision of leveraging existing distributed operating systems, and not building another. Just like MapReduce, it does not have: Resource Scheduler and Management, Distributed File System, Security setup, other common utilities available within a distributed operating system.

Apex leverages all YARN features without an overlap with YARN, while using HDFS as default persistent state store. All investments by enterprises in Hadoop in terms of expertise, hardware, and integration are leveraged. A rise in maturity of YARN will translate to a thoroughly mature platform capable of handling gigantic volumes of data, while ensuring the cost of operations never shoots out of the window.


Apex includes a sub-project, Apache Malhar – a library of pre-built Java operators. These pre-built operators for data sources and destinations of popular message buses, file systems, and databases, enable organizations to experience accelerated development of business logic. This reduces the time-to-market significantly, allowing for greater success in developing and launching big data projects.

Looking forward…

Looking ahead to where the Hadoop ecosystem is going, there will be even more companies that invest in real-time stream processing in order to gain actionable insights as data happens and is most valuable. While many companies have already accepted the fact that real-time streaming is valuable, we’ll see users looking to take it a step further and quantify their streaming use cases.

In the next year, customers using data streaming tools will reach a new level of sophistication and demand the ability to quantify ROI for streaming analytics. Technology companies are often the early adopters but as stream processing becomes more-and-more recognised for creating greater value from Hadoop, we can expect a diverse range of enterprise to invest.


Desmond Chan

Desmond Chan is director of product marketing at DataTorrent and Apache Apex community manager.

Inline Feedbacks
View all comments