10 tips for OctoML CLI power users to fast-track your model deployments
In this article
1. Keep track of images built by OctoML
2. Leverage OctoML’s automatic Docker hygiene
3. Get the most out of OctoML’s caching feature
4. Do cool things with magic environment variables
5. Interact with your model and Triton on the deployed container
6. Get set up for acceleration like a pro
7. Use cURL to inspect and test the OctoML container
8. Try advanced shell tricks with the Triton cURL interface
9. Managing ports used by gRPC and Prometheus via the OctoML CLI
10. Bind a container to a graphics card with Kubernetes
In this article
1. Keep track of images built by OctoML
2. Leverage OctoML’s automatic Docker hygiene
3. Get the most out of OctoML’s caching feature
4. Do cool things with magic environment variables
5. Interact with your model and Triton on the deployed container
6. Get set up for acceleration like a pro
7. Use cURL to inspect and test the OctoML container
8. Try advanced shell tricks with the Triton cURL interface
9. Managing ports used by gRPC and Prometheus via the OctoML CLI
10. Bind a container to a graphics card with Kubernetes
Last month, we released the OctoML CLI, a free command line interface to package your deep learning models into Docker containers bundled with NVIDIA Triton Inference Server. OctoML CLI makes it easy and fast to deploy your trained models into production on CPUs or GPUs in any cloud environment. For cost or speed sensitive deployments, you can request a consultation with our engineers for help onboarding, so you can accelerate your models with a variety of acceleration engines, such as Apache TVM, ONNX-Runtime, TensorRT, OpenVINO and more.
If you’re just getting started with OctoML CLI, check out our short 8-min developer demo and operations demo to see OctoML CLI in action. Then explore our end to end examples on GitHub to start building your first intelligent applications with OctoML.
The OctoML CLI is a sophisticated and powerful tool to fast-track your machine learning deployments. As we’ve worked with our early customers, we wanted to share our favorite tips to get the most out of your accelerated ML containers.
Here are ten tips for power users to get even more out of the OctoML CLI:
1. Keep track of images built by OctoML
The OctoML CLI names your Docker images [model name]-[target hardware]:latest
(this follows the Docker convention of naming images as repository:tag
). For example, if your model is named my_model
and you are building and deploying on a GCP Skylake instance (using the optional SaaS acceleration feature), then your built image will be named my_model-skylake:latest
.
If you’re building and deploying locally without requesting acceleration, then this will look like my_model-local:latest
.
2. Leverage OctoML’s automatic Docker hygiene
Let’s say you changed your underlying model or some other parameters – which will result in a new image being built – but you kept the same model name in your workspace configuration file. The image that will be built will have the same name as your previous version, which can lead to poor Docker hygiene.
To address this, the OctoML CLI will clean up any old images that have conflicting names. If you’d like to keep those old versions around, the CLI will prompt you for a different tag to use on that old image. Set the environmental variable OCTOML_SKIP_TAGGING=1
to automatically trigger this cleanup.
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
my_model-local latest fe3b8dbb205a 12 minutes ago xxxGB
$ octoml package
<...snip...>
Please enter a version tag if you would like to retain the previous image, otherwise enter `skip`: newtag
<...snip...>
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
my_model-local latest 05ed31e431cf 2 minutes ago xxxGB
my_model-local newtag fe3b8dbb205a 13 minutes ago xxxGB
If you notice in the example above, the old image ID is now tagged with the requested newtag
, and the new image ID that was most recently built is tagged with latest
.
3. Get the most out of OctoML’s caching feature
OctoML CLI caches any downloaded models, packages, and base images for your convenience and to speed up your future OctoML CLI runs. However, if there are naming conflicts with new or updated models, then newly-built packages or images may not always get updated in the cache. If this occurs, one of the ways you can fix this is to clear out the cached items:
docker rmi <image ID>
octoml clean -a
The docker rmi
command un-tags and removes the image from Docker’s cache, while octoml clean
clears the CLI cache.
If you need the image ID, first run docker images to list all images.
4. Do cool things with magic environment variables
For automation or neat tricks these environment variables may be useful for you:
OCTOML_AGREE_TO_TERMS=1
Automatically agree to OctoML terms of service and privacy policy without being prompted.OCTOML_TELEMETRY=false
Opt out of telemetry, which we collect on CLI usage to improve the tool for all users.OCTOML_KILL_EXISTING_DOCKER_CONTAINER=1
If you have a container running that was previously deployed by the OctoML CLI, we prompt you to confirm whether you want that container killed before deploying a new container. This flag let’s you deploy any newly-built images without being prompted.OCTOML_SKIP_TAGGING=1
If you have old images that will run into a naming conflict from a previous run of the OctoML CLI, we prompt you to tag that old image if you’d like to keep it around. Use this flag to automatically clean up (delete) any old images that have conflicting names without being prompted.OCTOML_ACCESS_TOKEN=<your_token_here>
Use your OctoML SaaS Platform token to accelerate your model.
5. Interact with your model and Triton on the deployed container
Once you have a container running via octoml deploy
(or via octoml package
followed by docker run
), you may want to check the packaged model in that container. The docker exec commands lets you invoke commands in the running container. By default, the packaged model will be located in the container under octoml/models/
.
$ docker exec -it <container id> /bin/bash
root@272a53012852:/opt/tritonserver# cd octoml/models/
root@272a53012852:/opt/tritonserver/octoml/models# ls
my_model
The -i
option enables interactive mode to keep STDIN open even if not attached, and -t
allocates a pseudo-TTY.
If you need to find your running container id (which is different than the image id), you can run docker container ls
:
$ docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5c7ba68823e4 50196694baee "/opt/nvidia/nvidia_…" 4 hours ago Up 4 hours 0.0.0.0:8000-8002->8000-8002/tcp tedious-brake
A common and useful command to explore from within the container is the tritonserver
command, which lets you set dozens of configuration options and settings for Triton Inference Server (note: this command should be run after you’ve already used docker exec
to run commands within the container). Run this command to learn more about available configurations:
$ /opt/tritonserver/bin/tritonserver --help
6. Get set up for acceleration like a pro
octoml init
and octoml setup acceleration
will launch interactive wizards that help you set up the project configuration YAML file for acceleration, including submitting your access token, hardware(s) of interest, and specifying inputs for dynamic models if applicable. These wizards will automatically generate well-formed project configuration files on your behalf.Your project configuration is primarily defined in the octoml.yaml
file (which you can find in the same folder where you’ve run the octoml init
command). Here, you will define the model you’d like to accelerate and package, as well as any hardware targets, inputs, etc. Read on if you’d like to understand how to write your own config from scratch, which you may need, if you plan on fully automating your model deployment for CI/CD pipelines.
First, a bit about the terms we’re using: a dynamic model is a model where some input dimensions (e.g. batch size, input channels) are not specified until runtime. For a static model, those dimensions are known ahead of time or otherwise specified within the model. An example of a dynamic model is YOLOv3 because you have to specify image size and batch size. ResNet50 is an example of a static model since all input dimensions, including batch size, are predefined.
When you request acceleration on a dynamic model, you are required to define your input dimensions. Input dimensions that need to be defined vary for each model. Examples of input dimensions include batch size, image size, number of channels, number of tokens, vocabulary size, embedding length, sequence length, etc.
You do not need to define input dimensions for static models, unless you want to override the default shapes and dimensions. There is one exception, and that is when you’re packaging a TorchScript model. In this case, you will need to manually specify inputs for both static and dynamic models. In doing so, you need to ensure that all your input shape names are suffixed with __#
, where the number represents the order in which you defined your inputs in your PyTorch model and is indexed starting from 0. You will also need to specify your output shapes in the same way. Here is an example of what a configuration file looks like for a TorchScript model that is being packaged without acceleration.
---
models:
torchscript_mnist:
url: https://storage.googleapis.com/prod-public-models/torchscript_mnist.pt
inputs:
x__0:
shape:
- 1
- 1
- 28
- 28
dtype: fp32
outputs:
y__0:
shape:
- 1
dtype: fp32
type: torchscript
You are welcome to use this as a template to write your own configuration. However, we strongly recommend starting with the setup wizards (i.e. octoml init
and octoml setup acceleration
) because it means you will not have to worry about any of these formatting details for any models, TorchScript or otherwise. After the setup wizard has created your custom YAML file, you can review it to understand how to generate a similarly formatted file programmatically.
7. Use cURL to inspect and test the OctoML container
Triton Inference Server’s Trtserver process is the inference server executable and by default it listens for HTTP requests on port 8000 and for gRPC requests on port 8001.
cURL can make HTTP requests to port 8000 and jq
is a JSON processing utility.
Get names and versions of all loaded models:
$ curl -s -X POST localhost:8000/v2/repository/index | jq '.'
[
{
"name": "fmnist",
"version": "1",
"state": "READY"
},
{
"name": "simple_int8"
},
{
"name": "simple_string",
"version": "1",
"state": "READY"
},
]
Get configuration for a loaded model, including input shapes:
$ curl -s localhost:8000/v2/models/fmnist | jq '.'
{
"name": "fmnist",
"versions": [
"1"
],
"platform": "tensorflow_savedmodel",
"inputs": [
{
"name": "conv2d_13_input",
"datatype": "FP32",
"shape": [
-1,
28,
28,
1
]
}
],
"outputs": [
{
"name": "dense_21",
"datatype": "FP32",
"shape": [
-1,
10
]
}
]
}
Get statistics of a loaded model:
$ curl -s localhost:8000/v2/models/fmnist/stats | jq '.'
{
"model_stats": [
{
"name": "fmnist",
"version": "1",
"last_inference": 0,
"inference_count": 0,
"execution_count": 0,
"inference_stats": {
"success": {
"count": 0,
"ns": 0
},
"fail": {
"count": 0,
"ns": 0
},
"queue": {
"count": 0,
"ns": 0
},
"compute_input": {
"count": 0,
"ns": 0
},
"compute_infer": {
"count": 0,
"ns": 0
},
"compute_output": {
"count": 0,
"ns": 0
},
"cache_hit": {
"count": 0,
"ns": 0
},
"cache_miss": {
"count": 0,
"ns": 0
}
},
"batch_stats": []
}
]
}
Get configuration of the triton server:
$ curl -s localhost:8000/v2 | jq '.'
{
"name": "triton",
"version": "2.20.0",
"extensions": [
"classification",
"sequence",
"model_repository",
"model_repository(unload_dependents)",
"schedule_policy",
"model_configuration",
"system_shared_memory",
"cuda_shared_memory",
"binary_tensor_data",
"statistics",
"trace"
]
}
8. Try advanced shell tricks with the Triton cURL interface
You can chain jq
and shell commands together to build useful tools. The following is a one line command to get queue time (in nanoseconds) per request for the py_model
model.
$ curl -s localhost:8000/v2/models/<your_model_name>/stats | jq '.model_stats[].inference_stats.queue' | jq -c '{queue_time_ns_per_request: (.ns / .count), count: .count }'
You can even define a shell alias for commonly used tasks:
alias tritonindex='curl -s -X POST localhost:8000/v2/repository/index | jq '.'
Then it’s possible to run it directly:
$ tritonindex
{
"name": "simple_string",
"version": "1",
"state": "READY"
}
9. Managing ports used by gRPC and Prometheus via the OctoML CLI
Sometimes, when developing locally with lots of microservices, or if you are running two instances of Triton, you may end up with port conflicts. It can be useful to disable gRPC and Prometheus in these cases.
8010
, to enable two Triton instances on the same host. This can also be done for the gRPC and metrics servers, but is not shown in the following example.Add the following arguments to the tritonserver
command:
tritonserver --http-port 8010 --allow-grpc false --allow-metrics false --allow-gpu-metrics false
Refer to Tip #5: “Interacting with your model and Triton on the deployed container” for details on executing the tritonserver
command inside the deployed container.
10. Bind a container to a graphics card with Kubernetes
Triton is designed to work with Kubernetes. Both images created with the OctoML CLI tool and default images from NVIDIA can be deployed to Kubernetes directly. However if you want to ensure that your Triton instance is running with a GPU, there are some non-standard configuration options you need to set.
Set a node selector (on the pod) to force it on to a specific node known to have a graphics card.
nodeSelector:
node.kubernetes.io/instance-type: "g4dn.2xlarge"
Then modify the pod template spec to add a resource request to request a GPU:
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
There is a complete example manifest in the Transparent AI repository: https://github.com/octoml/TransparentAI/blob/main/deploy/helm/templates/deployment.yaml#L126
Related Posts

New AI/ML Tool for Fans of Docker Compose and Kubernetes–OctoML CLI
We’re excited to share with you our first public release of OctoML CLI (v0.4.4) which provides an ML deployment workflow which should feel very familiar to anyone who uses Docker, Docker Compose and Kubernetes.

Fast-track to deploying machine learning models with OctoML CLI and NVIDIA Triton Inference Server
Today, we introduce the OctoML CLI, a Command Line Interface that automates the deploying deep learning models - model containerization and acceleration. One of the key technologies that ties our containerization and acceleration together is NVIDIA Triton Inference Server.