Apache MLlib — Making practical machine learning easy and scalable
Machine learning may sound futuristic, but it’s not. Speech recognition systems such as Cortana or Search in e-commerce systems have already showed us the benefits and challenges that go hand in hand with these systems. In our machine learning series we will introduce you to several tools that make all this possible. Second stop: MLlib, Apache Spark’s scalable machine learning library.
This interview is part of a Machine Learning series. Our second expert is Xiangrui Meng, Apache Spark PMC member and software engineer at Databricks. In this interview, he talks about MLlib, Apache Spark’s scalable machine learning library.
JAXenter: What is the idea behind Apache MLlib?
Xiangrui Meng: MLlib’s mission is to make practical machine learning easy and scalable. We want to make it easy for data scientists and machine learning engineers to build real-world machine learning (ML) pipelines. It includes not only fitting models but also stages such as data collection and labelling, feature extraction and transformation, model tuning and evaluation, model deployment, etc. This becomes a very hard problem when people try to solve each stage using different libraries and then chain them together in production (thinking of different languages, different tuning tips, different data formats, different resource requirements, etc). MLlib, combined with other components of Apache Spark, provides a unified solution under the same framework. For example, one can use Spark SQL to generate training data from different sources and then pass it directly to MLlib for feature engineering and model tuning, instead of using Hive/Pig for the first half and then downloading the data to a single machine to train models in R. The latter is actually very common in practice but painful to maintain. Spark MLlib makes life easier for data scientists and machine learning engineers so that they can focus on building better ML models and applications.
I’m looking forward to better integration with other components of Spark.
We also want MLlib to be capable of learning from large-scale datasets. There’s an enormous amount of data being collected everyday. Having more data leads to better potential to extract more value. However, it is often limited by the scalability of implementations (slow or infeasible to handle big datasets). MLlib provides scalable implementation of popular machine learning algorithms, which lets users train models from big dataset and iterate fast.
ML applications form a long tail. We do not expect MLlib to cover all of them. Instead, we provide standard feature transformers and ML algorithms as the basic building blocks and expose high-level APIs for developers and users to extend it to build their own applications. We maintain an index to help users find those packages.
JAXenter: Tell us more about what’s under this project’s hood: What language(s) do you use?
Xiangrui Meng: We use Scala to implement core algorithms and utilities in MLlib and expose them in Scala as well as Java, Python, and R. For most users, they can pick their favorite language and get started with MLlib.
Spark MLlib makes life easier for data scientists and machine learning engineers so that they can focus on building better ML models and applications.
JAXenter: Can you give us an example?
JAXenter: What does the future hold for MLlib?
Xiangrui Meng: We’ve made significant progress towards our mission but we’re not fully there yet. Considering the keywords “practical”, “easy”, and “scalable”, I think there are plenty of things that can be done along those directions and I feel excited about those possibilities. For example, we can design high-level APIs to address more complex real-world pipelines, improve feature parity across languages, or make MLlib implementation run faster and work with billions of features.
I’m also looking forward to better integration with other components of Spark, in particular, integrating with Project Tungsten to approach bare-metal performance and with Structured Streaming to do real-time machine learning.
As an open-source project, the future of MLlib lies in the community. With approximately 300 contributors and a very active community of users, I’m confident that MLlib will expand quickly and gain more and more contributors and users in the future.
JAXenter: What is so fascinating about machine learning?
Xiangrui Meng: Many believe that there are hidden gems in big data. Machine learning lets us see them via reflection.
JAXenter: Do you think machines will someday take over the world? Are those fears well founded?
Xiangrui Meng: Maybe, but perhaps not something I need to worry about in my life.
JAXenter: What are the top three blogs/movies/books that come to your mind when someone says they would like to know more about machine learning?
Thank you very much!