Custom Apache TVM Backend for NVIDIA Triton
One of the aspects we love about Triton is its extensibility. Out of the box, Triton supports several backends to execute inference against a model, such as PyTorch, TensorFlow, TensorRT, ONNX Runtime and OpenVINO. To enable additional performance gains and cloud savings on Triton, we developed a new Apache TVM backend. Apache TVM, which was created by the founders of OctoML, is one of the core engines in the OctoML SaaS platform. With the new TVM backend, Triton users can now also use a deep learning compiler stack to explore model optimizations and deploy TVM optimized models with Triton. By unlocking a new acceleration engine for Triton adopters, you get access to entirely new acceleration methods with direct implications for reduced latency and cost.
OctoML CLI and Triton Compose Generate Slimmer Container Images
A major benefit of using OctoML CLI and Triton to generate container images is significantly smaller image sizes than standard unoptimized Triton containers, which are 12 GB. The OctoML CLI packaging workflow leverages Triton’s Compose feature, to intelligently include only the backend and dependencies (like CUDA) needed to execute your accelerated model, reducing the footprint of the container you deploy and manage. This has several advantages, such as reduced infrastructure costs and a smaller attack surface. Our CPU-only images are as small as 440 MB!
For example, here are the container sizes for one style transfer model packaged into custom Triton containers with OctoML CLI and Triton Compose using express mode (see end of blog for a reference OctoML YAML file used in express mode) for various hardware targets: