More models into production, faster
We convert your deep learning computer vision models into model inference packages, automatically producing a Docker image that’s deployable into production.

Reduce costs while meeting business and customer needs
The OctoML Platform accelerates and benchmarks your computer vision models for your chosen targets, allowing you to meet technical or business SLAs and improve your user’s experience.


Video content moderation
Analyze images, videos, live streams, and other media content in real-time or offline batch processing. When model updates are inevitably required, deploy your retrained model quickly with our automated tooling and workflows.


Medical imaging
Analyze medical images like x-rays, MRIs and CT scans with speed and accuracy — without uploading private data to the cloud.


Smart cameras and computational photography
Optimize and deploy your ML models in the cloud to achieve a high-frame rate object detection and lower energy usage.
Upload your model for automated optimization
OctoML has certified vision models for successful ingestion and acceleration, including the most popular models: YOLOv5, MobileNetv2, and ResNet. You can automatically optimize your model across 5 acceleration engines and choose from over 80 cloud targets.

With Apache TVM, Microsoft Research develops and serves the latest computer vision algorithms on live streams

Read about our work

OctoML drives down production AI inference costs at Microsoft through new integration with ONNX Runtime ecosystem
Today, we’re excited to announce the results from our second phase of partnership with Microsoft to continue driving down AI inference compute costs by reducing inference latencies. Over the past year, OctoML engineers worked closely with Watch For to design and implement the TVM Execution Provider (EP) for ONNX Runtime - bringing the model optimization potential of Apache TVM to all ONNX Runtime users. This builds upon the collaboration we began in 2021, to bring the benefits of TVM’s code generation and flexible quantization support to production scale at Microsoft.

DIY Gen AI: Evaluating GPT-J Deployment Options
Model optimizations can save you millions of dollars over the life of your application. Additional savings can be realized by finding the lowest cost hardware to run AI/ML workloads. Companies building AI-powered apps will need to do both if they want a fighting chance at building a sustainable business.
Faster machine learning everywhere

Maximize Performance
Model acceleration through 5 engines and packaged for 100+ hardware targets.

Comprehensive Benchmarking
Get the best performance and lowest cost for running models in production.

Portable Deployment
Deploy in minutes using the OctoML CLI which outputs a Docker image package.
