Netflix open sources Metaflow, a Python library with AWS integration
Netflix often releases its internal tools to the public as open source code. The latest project to join the fray is Metaflow, a “deceptively simple” Python library for data scientists. Metaflow features integration with Amazon Web Services and includes a built-in capability to snapshot all code and data into Amazon Simple Storage Service.
Netflix open sourced Metaflow, a Python library for building and managing data science projects. Its source code is available on GitHub and it is ready for public use. Originally, Metaflow was developed internally by the Netflix Machine Learning Infrastructure team in order to help data science productivity.
It is powered by the AWS cloud and uses built-in integrations for storage and machine learning services.
Metaflow shares a number of similarities and concepts with projects such as Apache Airflow and Luigi, which developers may have experience with. For instance, with Metaflow, you can create and execute DAGs (directed acyclic graphs).
One of the key features is Metaflow’s ability to snapshot all of your code, data, and dependencies into Amazon S3 automatically.
With this, users can pick up and inspect previous workflows exactly where they left off. Using the
resume command, you can resume an execution of a past run at a failed step, add new steps, and alter code.
Let’s have a look at the code structure itself. From the documentation:
Metaflow follows the dataflow paradigm which models a program as a directed graph of operations. This is a natural paradigm for expressing data processing pipelines, machine learning in particular.
We call the graph of operations a flow. You define the operations, called steps, which are nodes of the graph and contain transitions to the next steps, which serve as edges.
External Python libraries
Python has become one of the standard languages for data science, and so it has a number of commonly used libraries. Metaflow allows importing external libraries and supports all common machine learning frameworks using the @conda decorator.
You can use all of the data science libraries that you are already familiar with, such as PyTorch and TensorFlow. Simply write your models using idiomatic Python code.
Over the years, Netflix has been one of the biggest powerhouses using AWS, directing most member traffic to AWS supported software. Thus, it is no surprise that Metaflow supports Amazon Web Services as its remote backend, using its tested and true services.
From the documentation, these are the current integrations (and two that will come in the future):
With the built-in powerful S3 client, Metaflow can load data up to 10Gbps. No code change is required for usage with AWS.
Read the documentation regarding Metaflow on AWS.
If you do not have an AWS account, Metaflow provides a hosted sandbox environment for data testing. You can use this sandbox for testing out tutorials and evaluating computation with AWS Batch. Request access to the sandbox mode here. Only a limited number of sandboxes can exist at the same time, so you will be added to a waitlist after signing in with your GitHub account.
Test it out with introductory tutorials. Go over its local machine capabilities and graduate to learning about AWS configuration and monitoring flows.
You can install Metaflow from PyPI using:
pip install metaflow
A community chat exists on Gitter for all your questions and discussions.