Machine learning with Oryx: Run wild with real-time ML
The open source machine learning frameworks just keep on coming. Oryx 2 is focused on real-time, large scale machine learning and uses the power of 3 tiers. Grab it by the horns and create custom applications.
Add another machine learning framework to your radar. Meet Oryx 2: an open source framework created “for real-time large scale machine learning”.
Large scale machine learning
From their website, Oryx is a “realization of the lambda architecture built on Apache Spark and Apache Kafka“. Apache Spark has the benefit of being incredibly fast in-memory, beating out Hadoop in the race. (Of course, Apache Spark and Hadoop can be used for different things, so your preference between the two may vary and depend on more than just speed alone. As you will see under the requirements, Oryx requires both.) Meanwhile, Apache Kafka is a distributed streaming platform that builds real-time streaming applications and data pipelines.
If it sounds familiar to you, it should! Oryx 2 is actually a sequel of its original project. Now updated it uses new architecture consisting of three tiers that can be implemented together or independently of one another.
The project’s GitHub invites programmers to “deploy a ready-made, end-to-end applications for collaborative filtering, classification, regression and clustering”.
Oryx’s main focus is real-time large scale machine learning. It is a workhorse of a framework. (A workoryx?)
The three tier system is the bread and butter of Oryx and knowing how to use them is the key to making great apps.
From the documentation:
A generic lambda architecture tier, providing batch/speed/serving layers, which is not specific to machine learning
A specialization on top providing ML abstractions for hyperparameter selection, etc.
An end-to-end implementation of the same standard ML algorithms as an application (ALS, random decision forests, k-means) on top
It’s all about mixing and matching the layers. While you don’t have to use them all, they can work together. Again, let’s take it right from the Oryx’s mouth and learn more about each layer:
A Batch Layer, which computes a new “result” (think model, but, could be anything) as a function of all historical data, and the previous result. This may be a long-running operation which takes hours, and runs a few times a day for example.
A Speed Layer, which produces and publishes incremental model updates from a stream of new data. These updates are intended to happen on the order of seconds.
A Serving Layer, which receives models and updates and implements a synchronous API exposing query operations on the result.
A data transport layer, which moves data between layers and receives input from external sources
The Batch and Speed layers are implemented as Spark Streaming processes, so they each run on a Hadoop cluster. Meanwhile, the data transport layer is an Apache Kafka topic and the serving layer helps maintain the model state in memory.
GitHub provides a helpful architecture diagram that will help you master this system.
Adding Oryx to your zoo
Do you have any burning use cases for Oryx? Check out their page about making an app with the framework.
In order to build an app with Oryx version 2.7.0, you will need:
- Git (or an IDE with Git support)
- Apache Maven 3.2.5 or later
- Java JDK 8 or later (Not so subtle reminder: With the release of Java 11, there is no better time to upgrade your JDK)
- Apache Hadoop cluster with the following updated components: Apache Hadoop, Apache Zookeeper, Apache Kafka, and Apache Spark
See the Javadoc for the nitty-gritty details.
Is this the big data framework for you? Give it a go and maybe you’ll be adding a new member to your machine learning zoo.
What kind of applications will you build with Oryx?