Bringing ML to the Azure Sphere with TVM

Thierry Moreau

Thierry Moreau

Jul 8, 2020

Last time, we talked about how TVM brings needed machine learning capabilities to bare-metal embedded devices, by laying an open source foundation to allow anyone to optimize and deploy deep learning workloads on a variety of hardware devices. But how would you use TVM in practice to deploy machine learning capabilities on secure cloud-connected IoT devices like Microsoft Azure Sphere?

Apache TVM empowers programmers to deploy customized deep learning models from popular frameworks (Pytorch, Tensorflow, MxNet etc.) onto various cloud and edge CPU and GPU backends. OctoML is committed in making TVM an industry leading open-source deep learning compiler and is investing in further automating model optimization, compilation and packaging for all hardware backends, including cloud-connected IoT devices with its OctoML SaaS platform.

In light of the development of µTVM (pronounced: “micro-TVM”), TVM’s microcontroller backend, we wanted to show how one would leverage Apache TVM to seamlessly optimize, compile, and deploy any tiny deep learning model on a cloud-connected IoT device, the Azure Sphere MT3620.

In this post, we demonstrate:

  • TVM’s ability to ingest an off-the-shelf pre-trained wake-word detection model (e.g. TensorFlow), and apply quantization to reduce the parameter footprint of the model by 2.3x.
  • ML-guided automated performance optimizations to get the network layers highly optimized on the Azure Sphere’s A7 core and reduce inference latency by 32%.
  • A complete application deployment on Azure Sphere that performs speech pre-processing on the Cortex-M4F core, and network inference on the Cortex-A7 core, along with code to reproduce the example.

And stay tuned for more info on how we’ll be automating this process entirely for you soon as part of our OctoML product now in private beta.

Deep Learning Challenges on Azure Sphere

Microsoft’s Azure Sphere is a high level application platform for Internet of Things which supports secure high level applications along with real-time I/O processing and cloud connectivity. Our study targets the MT3620, a recently released Azure Sphere platform that features an ARM Cortex-A7, two ARM Cortex-M4F, a 4MB SRAM memory and a 1MB Flash memory. Azure Sphere Cortex-A7 core runs a custom Linux kernel with limited functionalities. Some of the limitations which make this platform challenging for running deep learning algorithms are:

  • Memory: the cortex-A7 core on Azure Sphere has only 256KB SRAM for user applications despite having a total 4MB SRAM memory. Cortex-M4 cores each have 192KB TCM and 64KB SRAM for a total of 256KB of RAM per core. In addition, since Azure Sphere design is focused on security, the platform limits memory operations such as dynamic library loading.
  • Linux kernel: Azure Sphere operating system supports basic functionalities with no bash environment. In addition, C++ is not supported by the provided GCC compiler.

TVM compiler stack brings several tools to tackle challenges with Azure Sphere platform. TVM provides lightweight MISRA-C runtime with pure C implementation for standalone deployment on devices like Azure Sphere. In addition, TVM provides quantization tools which adds flexibility to deploy larger models without the need of retraining.

AutoTVM automates deep learning inference latency on Azure Sphere

AutoTVM automates deep learning inference latency on Azure Sphere

In this blog post, we deploy and run a keyword spotting model and evaluate the resulting memory usage and runtime. Finally, we show a real-time end-to-end demo of this model using an analog microphone connected to the Azure Sphere.

“Hello Azure Sphere” — A Keyword Spotting Model

Keyword Spotting (KWS) is a critical task for enabling speech detection on smart devices and requires real-time processing and high accuracy. The Hello Edge paper explores depthwise separable convolutional neural networks (DS-CNN). DS-CNN is able to achieve an accuracy of 95% which is 10% higher than other DNN models with similar parameter sizes on the speech command dataset. Below, we show the full architecture of KWS DS-CNN model.

Keyword Spotting DS-CNN Model Architecture. This model requires two steps for deployment; audio pre-processing and DS-CNN inference machine.

Keyword Spotting DS-CNN Model Architecture. This model requires two steps for deployment; audio pre-processing and DS-CNN inference machine.

We focus on the TensorFlow DS-CNN model developed by ARM which has 12 different labels for detecting 10 words (“Yes”, “No”, “Stop”, “Go” etc. ) plus silence and unknown (other words). We show our deployment on the Azure Sphere’s Cortex-A7, with audio pre-processing performed on the on-board Cortex-M4 in near real-time with an analog microphone. See the figure below for an architecture diagram of our demo setup. In this architecture, we deploy the audio pre-processing including MFCC feature extraction on Cortex-M4 with floating-point unit capability. We pass the MFCC features to the application core to perform TVM runtime since Cortex-A7 has enough memory resource to execute the MISRA-C runtime for a large model with large parameter. Finally, we change the color of four LEDs based on the output label.

KWS pipeline implemented on Azure Sphere M4 and A7 cores.

KWS pipeline implemented on Azure Sphere M4 and A7 cores.

We import the KWS model from Tensorflow into Relay to build the TVM module. The pre-trained model provided by ARM is in float32 and has a few pre-processing layers which are not supported today in TVM. We wanted to offload those operations to the Cortex-M4. Therefore, we first remove those layers from the Tensorflow graph and import it to Relay. The out of the box model does not fit into Azure Sphere due to memory limitation. We use quantization tool from TVM to reduce the parameter size. Here, we evaluate the accuracy before and after quantization using 1000 samples from the speech command dataset. For this quantization, we set global scale = 4.0 (full script).

In this plot we show the accuracy of the original model without quantization. Second, we quantize all layers of the model and the accuracy drops significantly. Next, we skip just the first convolution layer in quantization and achieve similar accuracy to the original model but still realize large parameter size savings. By using quantization we reduce the parameter size from 95KB to 41KB. This allows us to deploy KWS model on Azure Sphere that has stringent memory constraint. In this plot, we show the break down of memory usage on Azure Sphere Cortex-A7.

Reducing the parameter size allows us to deploy standalone runtime on Azure Sphere. We deploy the KWS model and evaluate the runtime. Next, we perform tuning to improve the performance. As this plot shows, we improve the runtime performance by 32%.

Real-time Demo

We built the KWS DS-CNN model to run on Cortex-A7. To have an end-to-end demo on Azure Sphere, we implement MFCC feature extraction on Cortex-M4. This application is a partner application that runs on Azure Sphere along with the TVM runtime. Here is the partner app and how to deploy this application on Cortex-M4. If you want to follow along, once you have deployed this app, follow the instructions here to deploy the KWS TVM runtime on your Cortex-A7.

Helow is a short video of the system in action: https://youtu.be/-UQSqQyqcoo

Things to look forward to

As we mentioned in our uTVM post, there’s room for improvements in this area to improve the developer experience. Work is already underway in uTVM to bring an improved out-of-the-box support experience for devices like Azure Sphere.

Azure Sphere has the potential to run machine learning models using TVM directly on the Cortex-M4. To support even large models on Cortex-M4, the uTVM roadmap explores new avenues of placing model weights in Flash as constant arrays. This will let us fit larger models on both Cortex-A7 and Cortex-M4 cores.

Auto-quantization enhancements are underway in order to quantize whole network parameters to int8 without significantly affecting classification accuracy. This should enable further parameter size reduction (to ~3x) without negatively affecting classification accuracy.

Finally, Ansor a.k.a. AutoTVM 2.0 is making its way to Apache TVM over the summer to deliver faster autotuning, and much improved performance thanks to a more powerful automated optimization search.

Conclusion

We show that TVM can deliver ease of deployment for application developers, and ML practitioners who want to deploy ML on cloud-connected IoT devices like the Azure Sphere. We used TVM’s broad model import, quantization, and autotuning capabilities to take an off-the-shelf Keyword Spotting model from TensorFlow and deploy it on the Cortex-A7 core of an Azure Sphere.

If you are looking to learn more about deploying ML to the edge, or are interested in deploying applications to the edge even easier with our up and coming automated platform sign up here. Also feel free to reach out to us directly at info@octoml.ai if you have business interests or application regarding this work.

Edit log:
July 23rd 2020: we corrected the RAM budget of the Cortex-M4F cores to reflect that TCM adds 192KB of additional RAM per core.

Maximize performance. Simplify deployment.

Ready to get started?