Skymind’s Deeplearning4j, the Eclipse Foundation, and scientific computing in the JVM
Why did Skymind join the Eclipse Foundation last month? Chris Nicholson, CEO of Skymind and creator of Deeplearning4j, explains why open sourcing its libraries was a step forward to show developers and enterprises that Deeplearning4j is mature, secure, and a safe bet for deep learning.
Last month, Skymind joined the Eclipse Foundation and contributed its open-source libraries. We did that to send a signal to developers and to their employers that Deeplearning4j is mature, secure, properly licensed and responsibly governed. They know we’re a safe bet.
The Eclipse Foundation is where a lot of major Java projects (like JavaEE) have gone recently. Eclipse is also a community. It’s a community of people and companies coming together to tackle large technical challenges together. They’re very supportive, and they’re doing interesting work, so that was a community we wanted to be a part of.
Finally, they gave us a lot of freedom, which you don’t always get with other foundations. We’ve seen some projects at other foundations become highly political, and that has slowed down their development. In tech, when you slow down, you die. Most tech companies die in their sleep. So we couldn’t afford to choose a foundation that posed a risk to moving Deeplearning4j forward.
How Deeplearning4j became the #1 DL tool for Java
Timing was a big part of our growth and community building. That is, we started working on Deeplearning4j at the right time. Machine learning, and deep learning in particular, have changed a lot in the last decade or so. Some important new technologies have been developed that support deep learning, and developments occurred that made it more powerful. Big data, GPUs, and a bunch of great machine-learning research was necessary to create the powerful algorithms we have today. None of that really existed in the 1990s or even the early 2000s.
Deeplearning4j was conceived in late 2013, when Adam and I were living in a hacker house in San Francisco. I think the first pull request came in the spring of 2014. The project was built to work on GPUs. It was built to incorporate recent research, because the new algorithms were producing very high accuracy. And it was built to integrate with the big data ecosystem that has emerged on the JVM in the last 10 years: Hadoop, Spark, Kafka, ElasticSearch, Cassandra. Deep learning needs big data to train on, and a lot of data science fails at the point when you need to integrate. We wanted to make sure this worked on the big data stack.
Adam’s 27. He learned software engineering in the 2000s by working with open-source software and those communities, on projects like Spring, for example. And Josh Patterson, our VP of field who has been working on this from the beginning, was an early employee at Cloudera working on Hadoop.
So from the start, we had a pretty good idea what would make an open-source project succeed: a kernel of hard-core engineers creating something powerful and being responsive to the community. Smart people attract smart people. Once you get a small but devoted community, it starts to snowball, and all those people help each other. Pretty soon they start to build cool projects on top of your code, and they blog about it.
And I promoted Deeplearning4j like mad. My background is journalism. When Adam and I started talking about Skymind, I was working as head of comms at a Sequoia-backed startup called FutureAdvisor. So I told some reporters I knew and said: Look, there’s this kid genius who’s taking the cool technology we hear about at Google and he’s implementing it open source in Java. This was 2014 and most people were just beginning to intuit how powerful deep learning was. That helped us gain momentum at the start. It’s not always enough to build something great. You need to make sure people know about it. So we worked hard to do that.
What we see now is a lot of deep learning tools that have been built for research, and very few that are suited to solving enterprise problems and deploying in production. We built Deeplearning4j for that, and we find a lot of the folks that have prototyped a deep neural net in Python are getting stuck when they deploy to production. Their tools don’t integrate with the rest of the big data stack. They run into a wall. DevOps doesn’t want to support a blob of Python and C code, because it’s insecure and hard to debug. So we see those people asking for help deploying their models on a JVM compute environment. So one way to think about Deeplearning4j is as a complete deep learning toolkit for the JVM, and another way is as a bridge between Python and the JVM, which a lot of companies need now.
It’s been interesting to see data science develop as a field. Python seems to have overtaken R as the main language, but a lot of data science projects struggle to produce value for the companies that fund them. And at the same time, there’s a ton of interesting work happening in Java, Scala, and Clojure. If you don’t choose your tools strategically, a gap can emerge between the worlds of data science and enterprise software, and that has real consequences, because it means that the data scientists you hire are unable to get their models running in production. Things fall apart on the last mile. We’ve done a lot of work to make sure we can take people from notebooks to production.
Beyond Deeplearning4j itself, we have a bunch of other libraries that help make deep learning happen on the JVM. Here’s the breakdown:
- Deeplearning4j: Neural network DSL (facilitates building neural networks integrated with data pipelines and Spark)
- ND4J: N-dimensional arrays for Java, a tensor library. The goal is to provide tensor operations and optimized support for various hardware platforms. Numpy for Java.
- DataVec: An ETL library that vectorizes and “tensorizes” data. Extract/transform/load with support for connecting to various data sources and outputting n-dimensional arrays via a series of data transformations
- libnd4j: Pure C++ library for tensor operations, which works closely with the open-source library JavaCPP (JavaCPP was created and is maintained by a Skymind engineer, but it is not part of this project).
- RL4J: Reinforcement learning on the JVM, integrated with Deeplearning4j. Includes Deep-Q learning used in AlphaGo and A3C.
- Jumpy: A Python interface to the ND4J library integrating with Numpy
- Arbiter: Automatic tuning of neural networks via hyperparameter search. Hyperparameter optimization using grid search, random search and Bayesian methods.
The Quiet Impact of ND4J
The whole of these libraries is greater than the sum of the parts, because each one solves problems that make the others work better. Taken together, they are the basis of a general-purpose scientific computing and machine learning ecosystem for enterprise.
In particular, ND4J’s numerical computing operations and its connection to fast C++ code through JavaCPP makes machine learning performant on the JVM big data stack. DataVec makes building data pipelines easier, and as anyone who has worked on AI knows, data pipelines are 80% of the work. The ability of Java and Scala to handle multi-threading and concurrency make them compelling alternatives to Python and R for the large-scale machine learning tasks faced by enterprise.