Apache Mahout — People familiar with R will have a leg up
Machine learning may sound futuristic, but its not. Speech recognition systems such as Cortana or Search in e-commerce systems have already showed us the benefits and challenges that go hand in hand with these systems. In our machine learning series we will introduce you to several tools that make all this possible. Fourth stop: Apache Mahout.
This article is part of a Machine Learning series. Our fourth expert is Pat Ferrel, a committer to Apache Mahout and principal consultant at FinderBots. In this article, he talks about Apache Mahout, a project which aims to build an environment for quickly creating scalable performant machine learning applications.
What is Apache Mahout?
Roll-your-own math in an interactive environment that actually executes on a big data platform, then move exactly the same code into your app and deploy. Mahout Samsara provides a distributed linear algebra and stats engine that is performant and distributed along with an interactive shell (now inside Apache Zeppelin) as well as the library to link into your application in production. No translation of code from R or Python to Java, just one integrated Scala based math lib with an R-like DSL.
Scala is the basis of all the new (Samsara) code and it runs on Spark (most mature) or Flink (prototype implementation). Mahout uses a mix of common core abstractions called the DSL for Domain Specific Language, which are loosely based on R so people familiar with it will have a leg up. It comes with a BLAS-style optimizer to streamline and pick the fastest code to execute DSL commands. It also binds the DSL to the base compute platform like Spark or Flink so that much can be written to run on either. But you also have access to the raw API of the compute platform so nothing is lost and you can mix Mahout Samsara with MLlib for example.
One area of interest for me is recommenders. An algorithm developed inside of Mahout makes heavy use of linear algebra. A core data type in Mahout is a distributed row matrix. The first step inMahout’s Correlated Cross-Occurrence multimodal recommender algorithm is a type of matrix multiply. For example if P = the history of all user purchases—rows being users and columns being products and V = the history of all product detail pages viewed by a user most algorithms would not deal with these 2 types of data. The math we need is P’P and P’V. As far as I know Mahout-Samsara is the only engine that will do both of these in an optimized way with just a few lines of code:
val drmP = IndexedDataset(pairRDD) val drmV = IndexedDataset(pairRDD) val modelPtP = drmP.t %*% drmP // the optimizer performs the best version of this based on several factors like data size and knowledge that this is a self-join val modelPtV = drmP.t %*% drmV // ditto but with the knowledge that this is not a self-join
There is a lot more to the algorithm but this core part, a simple construct in linear algebra, is not supported by other ML libraries because they tend to support only end-to-end algorithms, whereas Mahout Samsara supports the generalized math itself. That’s not to say that many common algorithms aren’t also implemented (like ALS, SSVD, PCA, etc.) but the difference is that Mahout concentrates on math and so can make any steps needed to build many of your own algorithms or data prep pipelines.
Plans for Apache Mahout
Through the Zeppelin integration we will get robust ML type visualizations, not just stats. Any visualization built into ggplot or matplotlib are available with native Mahout helper functions in the works. Further down the road we are working on native code speedups from C and C++ libs and GPU acceleration.
What is so fascinating about machine learning?
The math, the often massive size of data, being on the bleeding edge of doing things that could never be done before. Some applications in Recommenders are pretty close to reading a person’s mind, knowing what they will do before they do it. It doesn’t always work but when you get it right it’s a bit spooky. Then there’s the satisfaction of being in an area that is like search was 20 years ago. Recommenders are becoming as much a part of discovery as search is.
Who would imagine using Amazon without all their convenient recommendation and even their search is personalized in what they call “behavioral search”—haven’t you noticed? Now the rest of the world is catching up. If brick and mortar businesses are going to survive, they have to see the move to online as a move to being a technology company, not just a web company. I do consulting (ActionML.com) that helps give these technology noobs a leg up. They have to accelerate or get lost in the transition.
Do I think machines will someday take over the world? Of course they will. I can’t wait for Skynet and the Zylons because we are the gods that will make them.
Would you like to learn more about machine learning?
Two key Mahout committers wrote a book about designing distributed algorithms that is a great primer—Apache Mahout: Beyond MapReduce, Dmitriy Lyubimov and Andrew Palumbo (https://www.amazon.com/Apache-Mahout-MapReduce-Dmitriy-Lyubimov/dp/1523775785).
For a more informal intuitive discussion of single topics look for Ted Dunning’s Mini-books in the Practical Machine Learning Series. For instance he wrote one called Innovations in Recommendation (http://shop.oreilly.com/product/0636920033172.do)
We asked Pat Ferrel to finish the following sentences:
In 50 years’ time machine learning will be embedded in every thing, application, and service that uses a CPU.
If machines become more intelligent than humans I’ll have them fix my bugs.
Compared to a human being, a machine will never know the joy of a nice healthy bowel movement.
Without the help of machine learning, mankind would never transcend its organic limitations by augmenting organic learning with machine learning.
Take a look at our machine learning initiative: