Fast-track to deploying machine learning models with OctoML CLI and NVIDIA Triton Inference Server

Sameer Farooqui
André Kang-Moeller, Staff Software Engineer, OctoML

Jun 21, 2022

In machine learning, a model is trained once but inference is run millions of times. The past decade has seen impressive progress in architecting and training deep learning models, mostly driven by research laboratories and a few large tech companies. However, successful production deployments for inference of these state-of-the-art models has remained elusive to most companies. The challenges of converting trained models into hardware-optimized code with live inference endpoints requires skills outside the domain of traditional data scientists, machine learning engineers or software engineers.

Today we are excited to introduce the OctoML CLI, a Command Line Interface that automates the hardest parts of deploying deep learning models - model containerization and acceleration. The benefits of being unencumbered from these two challenges are significant - containers lets you run your model in a portable way on any Docker or Kubernetes infrastructure in any cloud or data center and hardware-specific model acceleration makes running deep learning models at scale feasible by massively reducing inference costs. One of the key technologies that ties our containerization and acceleration together is NVIDIA Triton Inference Server.

What is Triton Inference Server?

NVIDIA Triton Inference server high inference throughput diagram

Triton Inference Server Diagram (courtesy of NVIDIA)

In this blog, we discuss a core component of OctoML containers: NVIDIA Triton Inference Server. Triton is open-source software for running inference on models created in any framework, on GPU or CPU hardware, in the cloud or on edge devices. Triton allows remote clients to request inference via gRPC and HTTP/REST protocols via Python, Java and C++ client libraries. As a universal inference server, Triton helps users avoid lock-in to a specific framework by providing a consistent inference interface for client applications regardless of training framework or target hardware. This allows data science teams to choose their own tools without worrying about how their models will be deployed, or how intelligent applications will infer against their models.

NVIDIA Triton is the top choice for AI inference and model deployment for workloads of any size, across all major industries worldwide. Its portability, versatility and flexibility makes it an ideal companion for the OctoML platform.

Shankar Chandrasekaran, Product Marketing Manager, NVIDIA

How OctoML CLI and Triton work together to fast track your ML deployments

Last year, as the OctoML engineering team began evaluating containerization and inference server solutions to integrate into our platform, Triton quickly emerged as a perfect fit. Both OctoML and Triton have a singular goal to enable the rapid deployment of AI models into production. Triton offers a mature, hardware-agnostic inference engine with enterprise features such as Prometheus metrics. OctoML CLI makes it easy to deploy Triton containers with acceleration; regardless of the model format you start from, OctoML CLI will give you the most optimized model runtime container ready for production deployment. We do this by bringing four key benefits to your ML deployment workflow: automated containerization, model acceleration for cloud spend reduction, a custom Apache TVM backend for Triton, and automation of advanced Triton features like Compose for slimmer containers. Together OctoML CLI and NVIDIA Triton eliminate weeks of engineering effort by packaging trained models into deployable containers that can be run on any cloud – Amazon EKS, Azure AKS or Google GKE are all supported.

Automated Workflow for Acceleration with Intelligent Packaging

Why should you use OctoML CLI to generate your Triton containers? The key benefit OctoML CLI provides is an automated process to do the heavy lifting required to get your trained model accelerated for your specific hardware and packaged into the most efficient container possible without needing to become a Triton power user. Once you select some hardware targets you’re interested in as deployment candidates and submit your model to OctoML CLI, OctoML kicks off a series of steps:

  • First, we’ll optimize your model for each hardware target using multiple accuracy-preserving acceleration approaches in parallel such as Apache TVM, ONNX Runtime (with multiple execution providers including TensorRT and OpenVINO), different data layouts and more.
  • Next, OctoML compares benchmarking performance from the various candidate models discovered in the managed sweep above and chooses the best-performing model and its partner acceleration engine.
  • Finally, OctoML generates a custom, slim Triton container with the fastest model, auto-generated model configuration (config.pbtxt) and only one runtime backend to execute your model on the target hardware.

OctoML CLI takes the guesswork out of picking the best acceleration engine and hardware target for your model. By sharing insights around inference latency, throughput and costs, we empower you to make the tradeoffs you want for your business requirements and service-level SLAs. We also fully automate the process of generating an optimized Triton container serving an accelerated model. All you have left to do is deploying the container to a Kubernetes cluster - which is easy with the scripts we provide.

Custom Apache TVM Backend for NVIDIA Triton

One of the aspects we love about Triton is its extensibility. Out of the box, Triton supports several backends to execute inference against a model, such as PyTorch, TensorFlow, TensorRT, ONNX Runtime and OpenVINO. To enable additional performance gains and cloud savings on Triton, we developed a new Apache TVM backend. Apache TVM, which was created by the founders of OctoML, is one of the core engines in the OctoML SaaS platform. With the new TVM backend, Triton users can now also use a deep learning compiler stack to explore model optimizations and deploy TVM optimized models with Triton. By unlocking a new acceleration engine for Triton adopters, you get access to entirely new acceleration methods with direct implications for reduced latency and cost.

OctoML CLI and Triton Compose Generate Slimmer Container Images

A major benefit of using OctoML CLI and Triton to generate container images is significantly smaller image sizes than standard unoptimized Triton containers, which are 12 GB. The OctoML CLI packaging workflow leverages Triton’s Compose feature, to intelligently include only the backend and dependencies (like CUDA) needed to execute your accelerated model, reducing the footprint of the container you deploy and manage. This has several advantages, such as reduced infrastructure costs and a smaller attack surface. Our CPU-only images are as small as 440 MB!

For example, here are the container sizes for one style transfer model packaged into custom Triton containers with OctoML CLI and Triton Compose using express mode (see end of blog for a reference OctoML YAML file used in express mode) for various hardware targets:

OctoML CLI and Triton Compose Generate Slimmer Container Images

OctoML CLI and Triton Compose Generate Slimmer Container Images

Learn more about OctoML CLI and Triton Inference Server

Attend or watch our session this week at Amazon re:MARS 2022 titled “OctoML: Accelerated machine learning deployment on AWS (AUT203)”. If you’re attending re:MARS June 21 - 24 in Las Vegas, stop by our booth and say hi!

Get hands on with OctoML CLI using our tutorials on GitHub. Test drive deploying computer vision and NLP models to containers in Amazon EKS, Azure AKS or Google GKE and run inference to Triton Inference Server.


Reference to the octoml YAML used in the express packaging above

---
models:
  magenta_image_stylization:
    path: magenta_arbitrary-image-stylization-v1-256_2.tar.gz
    inputs:
      "placeholder:0":
        shape:
          - 1
          - 256
          - 256
          - 3
        dtype: fp32
      "placeholder_1:0":
        shape:
          - 1
          - 256
          - 256
          - 3
        dtype: fp32
    type: tensorflowsavedmodel

Accelerate Your AI Innovation