LinkedIn’s new open source framework offers native support of TensorFlow on Hadoop
When the readily available tools won’t cut it, build a new one! And this is exactly what LinkedIn did to natively run TensorFlow on Apache Hadoop; TonY is now open source. Let’s have a look at what’s under this framework’s hood!
How can you bridge the gap between the analytic powers of distributed TensorFlow and Apache Hadoop’s scaling powers? The answer is easy: TonY.
TonY: TensorFlow on YARN
The story goes like this: at one point, LinkedIn decided that running TensorFlow on “small and unmanaged ‘bare metal’ clusters” won’t cut it anymore so they needed “a scalable way” to process all the information. Even though the framework designed by Google supports distributed training, orchestrating distributed TensorFlow is a real challenge, according to a recent blog post by LinkedIn’s Jonathan Hung, Keqiu Hu, and Anthony Hsu. Not to mention that this had to be done manually.
We wanted a flexible and sustainable way to bridge the gap between the analytic powers of distributed TensorFlow and the scaling powers of Hadoop.
Enter TonY, a framework to natively run TensorFlow on Apache Hadoop. TonY enables running either single node or distributed TensorFlow training as a Hadoop application. If it looks familiar, that’s because it draws inspiration from Yahoo’s TensorFlow on Spark, Intel’s TensorFlowOnYARN project, and this pull request in the TensorFlow ecosystem repo.
In case you’re asking yourself why LinkedIn didn’t use TensorFlow on Spark or TensorFlowOnYARN, that’s because both solutions have their flaws, not to mention that the latter is no longer maintained. Unlike these options, TonY gives them “complete control over the resources in our Hadoop clusters”:
Similar to how MapReduce provides the engine for running Pig/Hive scripts on Hadoop, and Spark provides the engine for running scala code that uses Spark APIs, TonY aims to provide the same first-class support for running TensorFlow jobs on Hadoop by handling tasks such as resource negotiation and container environment setup.
TonY has three main components, namely Client, ApplicationMaster, and TaskExecutor and is built using Gradle.
To build TonY, run
The jar required to run TonY will be located in
Getting started with TonY
- The user submits TensorFlow model training code, submission arguments, and their Python virtual environment (containing the TensorFlow dependency) to Client.
- Client sets up the ApplicationMaster (AM) and submits it to the YARN cluster.
- AM does resource negotiation with YARN’s Resource Manager based on the user’s resource requirements (number of parameter servers and workers, memory, and GPUs).
- Once AM receives allocations, it spawns TaskExecutors on the allocated nodes.
- TaskExecutors launch the user’s training code and wait for its completion.
- The user’s training code starts and TonY periodically heartbeats between TaskExecutors and AM to check liveness.
TonY also implements various features to improve the experience of running large-scale training such as GPU scheduling, TensorBoard support, fine-grained resource requests and fault tolerance.