Today we’re excited to announce our partnership with Arm, which highlights our collaboration across a broad array of hardware and embedded systems. This relationship - which spans efforts in Apache TVM open source and our commercial OctoML platform - showcases the expanding ecosystem of partners that OctoML is engaging with to help customers be successful with their machine learning needs. Today, customers are challenged by the complex, yet largely manual effort to accelerate ML models as they attempt to move their trained models into production. Our Machine Learning Deployment Platform provides customers choice, automation, and performance in backing their efforts to cost optimize their inference or create new ML services that weren’t previously possible on IoT and embedded systems.
With that as the broader context, we’re eager to share world-class performance results for Arm Cortex-A72 processors with TVM achieving latency speedup of 1.9x from ONNX Runtime and 2.18x from TensorFlow Lite for a wide range of vision models. These results are a fantastic first step for OctoML as we add Arm-based hardware support to our commercial platform. Anyone can now do the same acceleration for their custom ML models with a few clicks or API calls in the OctoML platform.
What is the Cortex-A72?
The Cortex-A72 processor is a high-performance, low-power 64-bit processor that launched in April 2015 with a focus on efficiency. Arm designed the microarchitecture for the mobile market as a synthesizable IP core that is sold to other semiconductor companies to be implemented in their own chips such as Broadcom’s BCM2711 (used in Raspberry Pi 4), Qualcomm’s Snapdragon 650/652/653, Texas Instruments’ Jacinto 7 family of automotive and industrial SoC processors, and NXP’s i.MX8/Layerscape chips. With its broad adoption, the A-class backs some of the largest edge device categories in the world, from mobile phones, security cameras, autonomous vehicles, IOT devices, wireless and networking infrastructure, and other industry-specific machines.
Machine Learning at the edge is challenging
Running inference for ML models on edge devices comes with many benefits, such as significantly lower cloud computing costs and a faster user experience because data doesn’t have to travel to the cloud and back. But running production-level deep neural network applications on the edge also imposes new constraints on power consumption and compute and memory efficiency, which are more restricted than in traditional cloud environments. Because the models trained in frameworks such as TensorFlow or PyTorch are typically built for cloud-based applications, they can be scaled horizontally in the cloud by adding more servers to meet growing user demands. Edge devices, on the other hand, must make better use of the underlying hardware and meet efficiency requirements by optimizing the model itself for inference.
This is exactly the challenge the OctoML team has been focused on solving through open source collaboration with companies like Arm via the Apache TVM open source ML stack for performance and portability. Now with the addition of Cortex-A72 to the OctoML Machine Learning Deployment Platform, ML engineers get the power of TVM’s cutting edge acceleration techniques with the ease of OctoML’s unified platform--unlocking the ability to accelerate the latest model architectures.
Apache TVM performance on Cortex A72
To understand how Apache TVM’s model acceleration compares to TensorFlow Lite (TF Lite) and ONNX Runtime, we pulled all 25 of the hosted floating point sample models from TF Lite’s pre-trained model zoo. Then we optimized the FP32 models via three different software frameworks using their respective best practices: TVM 0.7 (within OctoML), default TF Lite 2.5.0.post1, and ONNX Runtime 1.8.1.
Our performance benchmarks show that TVM achieves best-in-class performance for a wide range of deep learning vision models, with an average latency speedup of 1.9x from ONNX Runtime and 2.18x from TF Lite.