OctoML joins the community effort to democratize MLPerf inference benchmarking

Grigori Fursin

Grigori Fursin

Sep 23, 2021

OctoML joins the community effort to democratize MLPerf inference benchmarking

Authors: Grigori Fursin, Thierry Moreau

Among the MLPerf Inference: Datacenter results released this week were the first two submissions accelerated via Apache TVM and automated with MLCommons' Collective Knowledge framework. By entering the machine learning inference benchmarking competition with our powerful and flexible open-source software stack we’re creating a blueprint for future submissions that will be completely vendor, hardware and ML framework agnostic - while still delivering world-class results.

Why is ML benchmarking difficult?

The past few years have seen an explosion of hardware devices that can run machine learning inference from server-grade GPUs and CPUs in the cloud to low power accelerators and embedded systems at the edge. Over 100 organizations are now building ML inference chips with hardware that spans three orders of magnitude in power consumption and five orders of magnitude in performance. Throw in a couple dozen ML software frameworks and libraries competing to power this wide range of hardware and you can see how challenging it becomes to pick the ideal hardware/software stack for each use case.

MLCommons set out to address this challenge by establishing the MLPerf Inference Benchmark in May 2020, bringing a way to consistently measure accuracy, speed and power usage of machine learning models across a myriad of hardware and software combinations. The MLPerf Inference Benchmark ultimately measures how fast systems can process inputs and produce inference results from a trained model - in a standardized way.

So far, about 20 organizations have submitted results to the MLPerf Inference Benchmark. With hundreds of organizations active in the ML inference software and hardware industry, we wondered why more organizations have not entered the benchmark to showcase how well ML models run on their hardware or tuned via their software stacks. We realized that the barrier of entry to MLPerf is quite high.

The first challenge for new participants is that getting a ML model running performantly on a new hardware platform requires manually building a complex pipeline, bridging various layers of the ML stack. Each layer increasingly brings more options and configuration parameters. This process often requires several months of effort from a dedicated ML systems engineering team.

The second challenge lies in optimizing the chosen full stack and fine tuning it with nascent optimizations to achieve world-class performance⁠—an exceedingly difficult task that is currently only achievable with the resources available to the largest organizations.

How Apache TVM and Collective Knowledge simplify ML inference benchmarking

At OctoML, we were particularly interested in these challenges, given our academic and industry backgrounds specialized in making this easy. Five years ago we created the Apache TVM deep learning compiler to automate the process of bringing machine learning to a wide diversity of hardware devices, without relying on vendor-specific narrow libraries. In parallel, we designed the Collective Knowledge (CK) framework to enable collaborative and reproducible ML systems benchmarking. By combining Apache TVM with CK, we’re proud to demonstrate a new way to bring ML models to life, faster than ever before.

Our flexible open-source stack lowers the barrier of entry for future participants by providing a powerful framework that is both hardware and training framework agnostic and can optimize almost any deep learning model on almost any deployment hardware.

Our two pioneering submissions to this year’s MLPerf Inference benchmark for data centers demonstrates what TVM and CK together can achieve for optimizing object detection models across multiple clouds. We optimized the ResNet50 model on an Amazon EC2 node with an Intel Xeon Platinum CPU and a Google Cloud node with an Intel Cascade Lake CPU. And most importantly, we achieved our results in a matter of days, instead of months that previous manual techniques would take.

With the recent donation of Collective Knowledge to the MLCommons foundation, we are excited to work with the community to continue lowering the barrier of entry for new MLPerf submitters.

Help us build the future of ML inference benchmarking and automation by participating in the Apache TVM and MLCommons communities.

We’d like to acknowledge and thank Alexander Peskov and Ilya Slavutin from Deelvin, Thomas Zhu from Oxford University, and the whole MLCommons community for their contributions, suggestions and feedback.

Accelerate Performance and Deployment Time