TensorFlow 1.7 boasts TensorRT integration for optimal speed
TensorFlow 1.7 has just arrived. We take a look at one of the cool new features in the latest release: full integration for TensorRT! What does that mean for our favorite machine learning project? Faster performances, for one thing.
It hasn’t even been a month since the latest TensorFlow was released, but the internet’s most popular machine learning project has a lot going on. TensorFlow has just announced that they will be fully integrated with TensorRT as of TensorFlow 1.7, which dropped yeseterday.
For those who haven’t used it before, TensorRT is a library that optimizes deep learning models for inference and creates a runtime deployment on GPUs in production environments. TensorFlow will be able to take advantage of many of TensorRTs FP16 and INT8 optimizations. In particular, TensorRT is especially efficient at maximizing throughput and minimizing latency during inference on GPUs by automatically selecting the most capable platform-specific kernels.
Developers should be able to improve their speed with an integrated workflow. When the TensorFlow team tested it, they found that ResNet-50 performed 8x faster under 7 ms latency with the TensorFlow-TensorRT integration using NVIDIA Volta Tensor Cores as compared with running TensorFlow only.
So, how does it work? TensorRT is a C++ library from NVIDIA that helps facilitate high performance inference on NVIDIA GPUs. TensorRT optimizes network definitions merging tensors and layers, transforming weights, choosing efficient intermediate data formats, and selecting from a large kernel catalog based on layer parameters and measured performance.
It also includes a number of import methods out of the box that help developers express trained deep learning models for TensorRT to optimize and run. Helpfully, there is an optimization tool that determines the fastest implementation of a particular model based on graph optimization and layer fusion.
There’s also a runtime that developers can utilize to execute TensorRT’s network in an inference context. Also, it’s designed to allow developers to leverage high-speed reduced precision capabilities of Pascal or Volta GPUs as an optional optimization.
In practice, TensorRT optimizes compatible sub-graphs and lets TensorFlow execute the rest. Developers are able to use the extensive TensorFlow feature set to rapidly develop their models with still getting the most optimal performance with TensorRT.
Previously, users who tried to integrate the two sometimes ran into unsupported TensorFlow layers, which had to be imported manually. Now, all developers need to do is ask TensorRT to optimize TensorFlow’s sub-graphs and replace each subgraph with a TensorRT optimized node.
In TensorFlow 1.7, TensorFlow will execute the graph for all supported areas before calling on TensorRT to execute the TensorRT optimized nodes. Let’s say your model has 3 parts (A, B, and C) and Part B is optimized by TensorRT and replaced with a single node. When TensorFlow runs inference on this model, TensorFlow executes A, then calls on TensorRT to execute B, and then TensorFlow executes C.
Additionally, the newly added TensorFlow API optimizes TensorRT can take the frozen TF graph, apply optimizations to sub-graphs, and send it back to TensorFlow with all the changes and optimizations applied.
TensorRT is a part of the TensorFlow 1.7 release. To get the new solution, you can use the standard pip install process:
pip install tensorflow-gpu r1.7