10 tips for OctoML CLI power users to fast-track your model deployments

OctoML Logomark Container

Jul 29, 2022

Last month, we released the OctoML CLI, a free command line interface to package your deep learning models into Docker containers bundled with NVIDIA Triton Inference Server. OctoML CLI makes it easy and fast to deploy your trained models into production on CPUs or GPUs in any cloud environment. For cost or speed sensitive deployments, you can request a OctoML SaaS platform token to accelerate your models with a variety of acceleration engines, such as Apache TVM, ONNX-Runtime, TensorRT, OpenVINO and more.

If you’re just getting started with OctoML CLI, check out our short 8-min developer demo and operations demo to see OctoML CLI in action. Then explore our end to end examples on GitHub to start building your first intelligent applications with OctoML.

The OctoML CLI is a sophisticated and powerful tool to fast-track your machine learning deployments. As we’ve worked with our early customers, we wanted to share our favorite tips to get the most out of your accelerated ML containers.

Here are ten tips for power users to get even more out of the OctoML CLI:

1. Keep track of images built by OctoML

The OctoML CLI names your Docker images [model name]-[target hardware]:latest (this follows the Docker convention of naming images as repository:tag). For example, if your model is named my_model and you are building and deploying on a GCP Skylake instance (using the optional SaaS acceleration feature), then your built image will be named my_model-skylake:latest.

If you’re building and deploying locally without requesting acceleration, then this will look like my_model-local:latest.

2. Leverage OctoML’s automatic Docker hygiene

Let’s say you changed your underlying model or some other parameters – which will result in a new image being built – but you kept the same model name in your workspace configuration file. The image that will be built will have the same name as your previous version, which can lead to poor Docker hygiene.

To address this, the OctoML CLI will clean up any old images that have conflicting names. If you’d like to keep those old versions around, the CLI will prompt you for a different tag to use on that old image. Set the environmental variable OCTOML_SKIP_TAGGING=1 to automatically trigger this cleanup.

$ docker images
REPOSITORY                    TAG                     IMAGE ID       CREATED          SIZE
my_model-local                latest                  fe3b8dbb205a   12 minutes ago   xxxGB

$ octoml package
<...snip...>
Please enter a version tag if you would like to retain the previous image, otherwise enter `skip`:  newtag
<...snip...>

$ docker images
REPOSITORY                    TAG                     IMAGE ID       CREATED          SIZE
my_model-local                latest                  05ed31e431cf   2 minutes ago    xxxGB
my_model-local                newtag                  fe3b8dbb205a   13 minutes ago   xxxGB

If you notice in the example above, the old image ID is now tagged with the requested newtag, and the new image ID that was most recently built is tagged with latest.

3. Get the most out of OctoML’s caching feature

OctoML CLI caches any downloaded models, packages, and base images for your convenience and to speed up your future OctoML CLI runs. However, if there are naming conflicts with new or updated models, then newly-built packages or images may not always get updated in the cache. If this occurs, one of the ways you can fix this is to clear out the cached items:

docker rmi <image ID>
octoml clean -a

The docker rmi command un-tags and removes the image from Docker’s cache, while octoml clean clears the CLI cache.

If you need the image ID, first run docker images to list all images.

4. Do cool things with magic environment variables

For automation or neat tricks these environment variables may be useful for you:

  • OCTOML_AGREE_TO_TERMS=1 Automatically agree to OctoML terms of service and privacy policy without being prompted.

  • OCTOML_TELEMETRY=false Opt out of telemetry, which we collect on CLI usage to improve the tool for all users.

  • OCTOML_KILL_EXISTING_DOCKER_CONTAINER=1 If you have a container running that was previously deployed by the OctoML CLI, we prompt you to confirm whether you want that container killed before deploying a new container. This flag let’s you deploy any newly-built images without being prompted.

  • OCTOML_SKIP_TAGGING=1 If you have old images that will run into a naming conflict from a previous run of the OctoML CLI, we prompt you to tag that old image if you’d like to keep it around. Use this flag to automatically clean up (delete) any old images that have conflicting names without being prompted.

  • OCTOML_ACCESS_TOKEN=<your_token_here> Use your OctoML SaaS Platform token to accelerate your model.

5. Interact with your model and Triton on the deployed container

Once you have a container running via octoml deploy (or via octoml package followed by docker run), you may want to check the packaged model in that container. The docker exec commands lets you invoke commands in the running container. By default, the packaged model will be located in the container under octoml/models/.

$ docker exec -it <container id> /bin/bash 
root@272a53012852:/opt/tritonserver# cd octoml/models/
root@272a53012852:/opt/tritonserver/octoml/models# ls
my_model

The -i option enables interactive mode to keep STDIN open even if not attached, and -t allocates a pseudo-TTY.

If you need to find your running container id (which is different than the image id), you can run docker container ls:

$ docker container ls
CONTAINER ID   IMAGE          COMMAND                  CREATED       STATUS       PORTS                              NAMES
5c7ba68823e4   50196694baee   "/opt/nvidia/nvidia_…"   4 hours ago   Up 4 hours   0.0.0.0:8000-8002->8000-8002/tcp   tedious-brake

A common and useful command to explore from within the container is the tritonserver command, which lets you set dozens of configuration options and settings for Triton Inference Server (note: this command should be run after you’ve already used docker exec to run commands within the container). Run this command to learn more about available configurations:

$ /opt/tritonserver/bin/tritonserver --help

6. Get set up for acceleration like a pro

octoml init and octoml setup acceleration will launch interactive wizards that help you set up the project configuration YAML file for acceleration, including submitting your access token, hardware(s) of interest, and specifying inputs for dynamic models if applicable. These wizards will automatically generate well-formed project configuration files on your behalf.

Your project configuration is primarily defined in the octoml.yaml file (which you can find in the same folder where you’ve run the octoml init command). Here, you will define the model you’d like to accelerate and package, as well as any hardware targets, inputs, etc. Read on if you’d like to understand how to write your own config from scratch, which you may need, if you plan on fully automating your model deployment for CI/CD pipelines.

First, a bit about the terms we’re using: a dynamic model is a model where some input dimensions (e.g. batch size, input channels) are not specified until runtime. For a static model, those dimensions are known ahead of time or otherwise specified within the model. An example of a dynamic model is YOLOv3 because you have to specify image size and batch size. ResNet50 is an example of a static model since all input dimensions, including batch size, are predefined.

When you request acceleration on a dynamic model, you are required to define your input dimensions. Input dimensions that need to be defined vary for each model. Examples of input dimensions include batch size, image size, number of channels, number of tokens, vocabulary size, embedding length, sequence length, etc.

You do not need to define input dimensions for static models, unless you want to override the default shapes and dimensions. There is one exception, and that is when you’re packaging a TorchScript model. In this case, you will need to manually specify inputs for both static and dynamic models. In doing so, you need to ensure that all your input shape names are suffixed with __#, where the number represents the order in which you defined your inputs in your PyTorch model and is indexed starting from 0. You will also need to specify your output shapes in the same way. Here is an example of what a configuration file looks like for a TorchScript model that is being packaged without acceleration.

---
models:
  torchscript_mnist:
    url: https://storage.googleapis.com/prod-public-models/torchscript_mnist.pt
    inputs:
      x__0:
        shape:
          - 1
          - 1
          - 28
          - 28
        dtype: fp32
    outputs:
      y__0:
        shape:
          - 1
        dtype: fp32
    type: torchscript

You are welcome to use this as a template to write your own configuration. However, we strongly recommend starting with the setup wizards (i.e. octoml init and octoml setup acceleration) because it means you will not have to worry about any of these formatting details for any models, TorchScript or otherwise. After the setup wizard has created your custom YAML file, you can review it to understand how to generate a similarly formatted file programmatically.

7. Use cURL to inspect and test the OctoML container

Triton Inference Server’s Trtserver process is the inference server executable and by default it listens for HTTP requests on port 8000 and for gRPC requests on port 8001.

cURL can make HTTP requests to port 8000 and jq is a JSON processing utility.

Get names and versions of all loaded models:

$ curl -s -X POST localhost:8000/v2/repository/index | jq '.'

[
   {
    "name": "fmnist",
    "version": "1",
    "state": "READY"
  },
  {
    "name": "simple_int8"
  },
  {
    "name": "simple_string",
    "version": "1",
    "state": "READY"
  },
]

Get configuration for a loaded model, including input shapes:

$ curl -s localhost:8000/v2/models/fmnist | jq '.'
{
  "name": "fmnist",
  "versions": [
    "1"
  ],
  "platform": "tensorflow_savedmodel",
  "inputs": [
    {
      "name": "conv2d_13_input",
      "datatype": "FP32",
      "shape": [
        -1,
        28,
        28,
        1
      ]
    }
  ],
  "outputs": [
    {
      "name": "dense_21",
      "datatype": "FP32",
      "shape": [
        -1,
        10
      ]
    }
  ]
}

Get statistics of a loaded model:

$ curl -s localhost:8000/v2/models/fmnist/stats | jq '.'
{
  "model_stats": [
    {
      "name": "fmnist",
      "version": "1",
      "last_inference": 0,
      "inference_count": 0,
      "execution_count": 0,
      "inference_stats": {
        "success": {
          "count": 0,
          "ns": 0
        },
        "fail": {
          "count": 0,
          "ns": 0
        },
        "queue": {
          "count": 0,
          "ns": 0
        },
        "compute_input": {
          "count": 0,
          "ns": 0
        },
        "compute_infer": {
          "count": 0,
          "ns": 0
        },
        "compute_output": {
          "count": 0,
          "ns": 0
        },
        "cache_hit": {
          "count": 0,
          "ns": 0
        },
        "cache_miss": {
          "count": 0,
          "ns": 0
        }
      },
      "batch_stats": []
    }
  ]
}

Get configuration of the triton server:

$ curl -s localhost:8000/v2 | jq '.' 
{
  "name": "triton",
  "version": "2.20.0",
  "extensions": [
    "classification",
    "sequence",
    "model_repository",
    "model_repository(unload_dependents)",
    "schedule_policy",
    "model_configuration",
    "system_shared_memory",
    "cuda_shared_memory",
    "binary_tensor_data",
    "statistics",
    "trace"
  ]
}

8. Try advanced shell tricks with the Triton cURL interface

You can chain jq and shell commands together to build useful tools. The following is a one line command to get queue time (in nanoseconds) per request for the py_model model.

$ curl -s localhost:8000/v2/models/<your_model_name>/stats | jq '.model_stats[].inference_stats.queue' | jq -c '{queue_time_ns_per_request: (.ns / .count), count: .count }'

Note: to pull out these statistics, you must first run inference against the model. This is needed to populate the counts variable; otherwise, you may get a “cannot be divided by zero” error.

You can even define a shell alias for commonly used tasks:

alias tritonindex='curl -s -X POST localhost:8000/v2/repository/index | jq '.'

Then it’s possible to run it directly:

$ tritonindex 
{
  "name": "simple_string",
  "version": "1",
  "state": "READY"
}

9. Managing ports used by gRPC and Prometheus via the OctoML CLI

Sometimes, when developing locally with lots of microservices, or if you are running two instances of Triton, you may end up with port conflicts. It can be useful to disable gRPC and Prometheus in these cases.

Note: in the example below we’re moving HTTP to port 8010, to enable two Triton instances on the same host. This can also be done for the gRPC and metrics servers, but is not shown in the following example.

Add the following arguments to the tritonserver command:

tritonserver --http-port 8010 --allow-grpc false --allow-metrics false --allow-gpu-metrics false

Refer to Tip #5: “Interacting with your model and Triton on the deployed container” for details on executing the tritonserver command inside the deployed container.

10. Bind a container to a graphics card with Kubernetes

Triton is designed to work with Kubernetes. Both images created with the OctoML CLI tool and default images from NVIDIA can be deployed to Kubernetes directly. However if you want to ensure that your Triton instance is running with a GPU, there are some non-standard configuration options you need to set.

Set a node selector (on the pod) to force it on to a specific node known to have a graphics card.

    nodeSelector:
      node.kubernetes.io/instance-type: "g4dn.2xlarge"

Then modify the pod template spec to add a resource request to request a GPU:

        resources:
          limits:
            nvidia.com/gpu: 1 # requesting 1 GPU

There is a complete example manifest in the Transparent AI repository: https://github.com/octoml/TransparentAI/blob/main/deploy/helm/templates/deployment.yaml#L126

Accelerate Your AI Innovation