Quickly deploy YOLOv5 to your choice of hardware in any cloud using the new OctoML CLI

Vanessa Yan
Sameer Farooqui

Nov 25, 2022

Follow along with our new YOLOv5 deployment tutorial to power your next object detection application. Or, watch this tutorial video by Smitha Kolan on how to deploy YOLOV5 in under 15 minutes using the OctoML CLI.

Today, we are excited to announce the results of our collaboration with Ultralytics to deploy the YOLOv5 models to over 100 CPU and GPU hardware targets in AWS, Azure and GCP. Our engineering work with Ultralytics unlocks the ability to deploy YOLOv5 models on hardware from Intel, NVIDIA, Arm and AWS, with minimal effort and cost. In this blog, I’ll show you how simple it is to achieve hardware independence and cost savings across multiple clouds.

There are three tasks that YOLOv5 models can be used for: Object detection, image classification or image segmentation. When used for object detection, the most popular use case, YOLOv5 adds bounding boxes around one or more objects detected in an image. By default, YOLOv5 models are pretrained on the COCO 2017 dataset, which contains 80 object categories (such as person or bicycle) in more than 200,000 images. For example, pass in an image of Einstein riding a bike, and YOLOv5 identifies the following objects:

Image of Einstein on a bike with object detection by YOLOv5

Object detection via YOLOv5s detects a person and 2 bikes

The YOLOv5 architecture is effective for vision tasks, and there are several model variants to choose from to find the ideal balance of accuracy vs inference speed. Most importantly, Ultralytics’ investment in supporting the model (helper inference scripts, documentation, pretraining, etc) makes it easy for users to get started.

However, deploying a model to production brings new challenges. Once a ML model candidate, such as YOLOv5, has been identified, there are typically three phases in an engineer’s journey to deploy that model: quick local prototyping, ideal hardware identification, and production deployment. Over the past few months OctoML engineers have been working with Ultralytics to address these three challenges.

For local prototyping, we used the OctoML CLI to ingest any of the 10 YOLOv5 model variants and, within minutes, create an immediately deployable Docker container. The container is packaged with the model and a Triton Inference Server. OctoML has ensured Ultralytics’s inference script (detect.py) works well with OctoML containers for inference, so you only need to run a simple command for detection, classification or segmentation. For more details, check out the new tutorial to deploy YOLOv5 to a container we’re launching today.

Second, hardware selection is a critical, but often overlooked, part of productionizing a model. The hardware you choose ultimately determines your long-term ML production costs and user experience. For any hardware target you seek to explore, OctoML delivers maximal cost reduction, latency improvement, and/or throughput improvement. We do so by creating an automated search over state-of-the-art acceleration engines like Apache TVM, ONNX Runtime and TensorRT. Now, after just a handful of clicks in the OctoML UI or via an automated workflow using OctoML CLI, you can accelerate and benchmark any YOLOv5 model on over a hundred hardware targets in AWS, Azure and GCP. Furthermore, OctoML’s web UI and CLI emit Docker container artifacts that can quickly be deployed to managed Kubernetes services in the cloud. If you later decide to change your instance type or even cloud provider, OctoML gives you the portability to create a new container artifact to quickly switch over to new hardware platforms.

Finally, production deployment of a model requires a different set of challenges such as dependency management, CI/CD, Kubernetes integrations, logging and monitoring. Once you upload your model to OctoML and select your hardware target, OctoML does the hard work for you. We accelerate and package your model into a production-ready container that can be deployed to any managed Kubernetes servers in the cloud such as Amazon EKS or Azure AKS with our provided Helm charts. Additionally, OctoML containers come with Prometheus metric monitoring capabilities useful in production. Want to update your production model every week? OctoML’s CLI can be leveraged to automate the entire process to make your CI/CD integration as seamless as possible.

Let’s take a closer look at how OctoML works to fast-track your YOLOv5 deployment workflow.

Local prototyping with OctoML CLI and Docker Desktop

Running a new ML model locally on your laptop typically involves installing over a dozen dependencies such as numpy, torch, opencv libraries, etc - and YOLOv5 is no different. Installing new libraries can be an arduous task with version management and environment management. Now, with OctoML CLI, you can leverage the power of Docker containers to keep your core laptop environment clean and quickly get a Docker container running locally with your model and all of its inference requirements pre-installed.

Using OctoML CLI, deploy YOLOv5 and run inference within minutes to Docker Desktop container

Using OctoML CLI, deploy YOLOv5 and run inference within minutes to Docker Desktop container

Here’s how it works:

  1. Select any of the YOLOv5 pre-trained checkpoints that you’d like to prototype with
  2. Create a client code container with Ultralytics’ detect.py script
  3. Run OctoML CLI to package and containerize your model with Triton Inference Server
  4. Run inference!

Follow along with our new step by step tutorial to create your first object detection application.

It only takes 3 commands to package and deploy YOLOv5 to a container in Docker Desktop

It only takes 3 commands to package and deploy YOLOv5 to a container in Docker Desktop

By following the tutorial, you will have a running a Docker container with your selected YOLOv5 model and NVIDIA’s Triton Inference Server:

A Docker container with YOLOv5 model and NVIDIA's Triton Inference Server

A running Docker container with your selected YOLOv5 model and NVIDIA’s Triton Inference Server

The final step in the local prototyping phase is to send an image to YOLOv5 for inference. From collaboration with Ultralytics, we have updated detect.py to make running inferences to Triton Inference Server (running at localhost:8000) as easy as running a single command:

>> python detect.py --source /workspace/docker-mount/images-einstein.jpeg
--weights http://localhost:8000
The updated detect.py code making running inferences to Triton Inference Server simpler

The updated detect.py code making running inferences to Triton Inference Server simpler

Achieve hardware independence with automated acceleration and benchmarking

When you deploy YOLOv5 with OctoML, you can choose from over a hundred hardware types in AWS, Azure or Google Cloud. Ideally, before you choose your final production hardware target, you’ll want to find the best possible performance for your model on each hardware option, by exploring multiple acceleration engines in potentially multiple clouds. But this process can be extremely time consuming and requires learning new skills outside the realm of a typical data scientist or ML engineer’s education. So most teams just don’t do this and choose a cloud instance based on assumptions or benchmarking of other related models. This leaves cost savings on the table.

With OctoML you can upload your model, click through to select which hardware candidates you wish to explore, and let us do the hard work for you. Our platform will accelerate your model for each hardware target and show you inference times for each. It takes the guesswork out of hardware selection. Deployment-related tradeoffs between cost and inference speed can now be made intelligently.

We ran the most popular YOLOv5 variant, YOLOv5s, which is a good compromise of accuracy and speed, through OctoML’s SaaS product to understand how inference speeds can vary across hardware targets. You can see the wide range of inference times below, from 4.20 ms on NVIDIA V100 (with an on-demand hourly rate of $3.06) to 42.93 ms on Graviton3 (which an on-demand hourly rate of $0.578) across AWS and Azure targets (prices accurate as of Sept 2022):

OctoML’s automated benchmarking helps you identify the ideal hardware and deployment configs that give you the most cost savings while still meeting the latency SLAs of your application

OctoML’s automated benchmarking helps you identify the ideal hardware and deployment configs that give you the most cost savings while still meeting the latency SLAs of your application

OctoML’s automated benchmarking helps you explore the tradeoffs between latency and throughput for your model, so that you can intelligently pick a hardware target and deployment configuration that meets your SLAs

OctoML’s automated benchmarking helps you explore the tradeoffs between latency and throughput for your model, so that you can intelligently pick a hardware target and deployment configuration that meets your SLAs

OctoML offers inference benchmarks and deployable packages on over 100 cloud instances

OctoML offers inference benchmarks and deployable packages on over 100 cloud instances

Productionize YOLOv5 with OctoML

We’re here to help you get your object detection, image segmentation, and image classification projects over the finish line and into production! Request a consultation with our engineering staff today to get your toughest model deployment questions answered. Our world class experts can guide you through hardware selection, answer your technical questions around machine learning compilation or ONNX Runtime, and help integrate your ML application with NVIDIA’s Triton Inference Server for a robust and successful deployment.

Accelerate Your AI Innovation