Studio.ML bridges the gap between data scientists & DevOps engineers
There are a number of issues that arise when data scientists and ML researchers meet DevOps to try to deploy, audit, and maintain state-of-the-art AI models in a production and commercial environment. Peter Zhokhov and Arshak Navruzyan discuss a new open source software tool called Studio.ML, which offers a number of solutions to this problem.
Machine learning (ML) and artificial intelligence (AI) enabled technologies to permeate most industries these days. Much like software in the early computer era, ML and AI have stopped being a toy for researchers and have become a serious revenue-generating instrument for businesses. Similar to how software development is being driven by business goals, there is a need to make the cycle of development to production of ML models shorter, more robust, and more easily reproduced. This need often collides with the “new and shiny thing” syndrome, a known anti-pattern in software engineering. But for ML/AI-driven enterprises, the ability to use the results of latest research often means a competitive edge and a tangible profit.
In other words, if your company’s data scientists and platform engineers make each other’s lives miserable, don’t be alarmed — it is not unusual. In fact, that is exactly the situation we found ourselves in at Sentient Technologies. When we tried to build a production-robust deep learning and distributed computing framework to work with Java, we found that all of our newly hired data scientists preferred building the models using Python instead.
To elaborate more on the origin of the problem, consider the typical data science and DevOps pipelines in Fig. 1. To be clear, we are not claiming to cover all the possible use cases with this simplified picture, only the most frequent and relevant to productizing AI/ML models. In the data science world, as seen in row A, the typical process starts with gathering the data, iterating on the data and debugging the code. Then, this is where heavy compute usually kicks in with optimizing hyper-parameters and neural network architecture, either by hand or using modern automatic methods such as neuro-evolution. Finally, the process has trained the best model. After that, data scientists usually do a nebulous thing called “pushing the model to production” which usually comes down to handing over the best model to the DevOps or engineering team.
At the same time, in the DevOps world shown in row B, the process starts with a container. The container has to pass a bunch of unit and regression tests, get upgraded to staging, pass soak tests and then go to production. The model container then becomes a part of the microservice architecture, starts interacting with other production system components, before it gets autoscaled and load-balanced. With these two worlds in mind, one can see the cracks where things tend to fall into; when the model is not tied to the container well enough, it bounces around when the container is shipped over the rough seas.
DevOps tends to test a container functionality nominally — without checking if the model functions correctly — whereas data scientists’ tests do not take into account the fact that the model will be working in a container. The pre-processing functions locations of weight files, etc. may be in different folders. Our experience tells us that it is these types of small things that end up being overlooked.
This DevOps and data science dichotomy has pushed us to create the open-source project Studio.ML. Studio.ML aims to bridge the gap between research-y data science that has to chase the new and shiny thing as often as needed and the kind of good software engineering or DevOps practices that make results fully reproducible and avoid “runs on my machine” problems. Studio.ML also automates menial data science tasks such as hyper-parameter optimization and leverages cloud computing resources in a seamless way without any extra cognitive load on the data scientist to learn about containers or instance AMIs.
The core idea of the project is to be as non-invasive into the data scientist’s or researcher’s code as possible, storing experiment data centrally to make them reproducible, shareable, and comparable. In fact, Studio.ML can usually provide substantial value with no code changes whatsoever. The only requirement is that the code is in Python. Since the early days of us building deep learning frameworks in Java and experimenting with Lua, Python became a de-facto standard for machine learning community. This is due to its vast set of data analysis and deep learning libraries, so if you are a researcher working on the cutting edge of ML/AI, the requirement of writing Python code will likely be naturally fulfilled.
From a researcher’s perspective, if the following command line
python my_awesome_training_script.py arg1 arg2
trains the model, then
studio run my_awesome_training_script.py arg1 arg2
runs the training and stores the information necessary to reproduce the experiment such as a set of python packages with versions, command line, and state of the current working directory. The logs of the experiment (i.e. stdout and stderr streams) are stored and displayed in a simple UI:
SEE ALSO: The state of machine learning in 2018
If another researcher would like to re-run the same experiment, that can be done via
studio rerun <experiment_key>
Packaging experiments to be reproducible turns out to be an incredibly powerful idea. If the experiment can be reproduced on a machine of another researcher, it can also be run on a powerful datacenter machine with many GPUs or on a cloud machine. For example,
studio run --cloud=ec2 --gpus=1 my_awesome_training_script.py arg1 arg2
will run our experiment in Amazon EC2 using an instance with one GPU as if it was to be run locally. Note that to get that same result otherwise, the researcher has to either cooperate with a DevOps engineer or learn about EC2 AMIs, instance and tenancy types, GPU driver installation, and more.
Add to that other features like hyper-parameter search, using cheap spot / preemptible cloud instances, integration with Jupyter notebooks and others, it looks like Studio.ML is out there to make data scientists life easier. But what about the other side of the barricades: the DevOps engineers? Studio.ML comes with serving capabilities as well so a built model can be served as a single command line:
studio serve <experiment_key>
On the one hand, this allows for a simple containerization and deployment of the built models. But more importantly, simple serving enables unit/regression tests to be run by the data scientists themselves to eliminate frequent failure mode when the model behaves well in training and validation. Serving uses slightly different preprocessing code that is not being tested. Another Studio.ML feature that makes DevOps engineers’ lives easier is built-in fault-tolerant data pipelines that can do batch inference on GPUs at high rates, while weeding out bad data.
SEE ALSO: “The term DevOps may go away because we’re approaching a point where *all* IT is becoming DevOps”
Under the hood, Studio.ML consists of several loosely coupled components like experiment packer, worker, queueing system, metadata, and artifact storage that can be swapped to hone Studio.ML to individual needs of the project, such as custom compute farms or the in-house storage of the sensitive experiment data.
Studio.ML is still a fairly early-stage project and it has been shaped a lot by the machine learning community. Even at this stage, it provides reproducible experiments with much less friction such as code changes and cognitive load on data scientists than mature services such as Google Cloud ML or Amazon SageMaker. If you’re interested in the data about this, see our blog about reproducing state-of-the-art AI models using SageMaker and Studio.ML.
In summary, modern AI/ML-driven enterprise requires a combination of industry-grade reliability and reproducibility from the ML models and the research agility to leverage and contribute to the state-of-the-art data science. Studio.ML addresses these demands in a concise and non-intrusive manner. In our vision, it will continue to bridge the gap between data scientists and DevOps engineers by introducing more and more advanced ML automation features.