In an earlier era, deploying ML systems was not exactly easy, but at least the problem was contained. Engineers could focus on a few important models, all of which built on similar libraries and tools. And the host systems were typically commodity datacenter servers with one of a small set of GPUs.
That era of uniformity is gone. Everything in ML today is wildly heterogeneous, from the models themselves to the silicon they run on:
- Workloads are heterogeneous because of the exploding diversity in model architectures. It’s not all about CNNs anymore, and arguably never was: recommenders, language models, attention, and countless other mechanisms are all popular and all demand different system resources.
- Hardware is growing more heterogeneous because of challengers to mainstream CPU + GPU systems. Google’s TPU is the highest-profile example, but an explosion of efficient special-purpose ML accelerators are already widespread in mobile devices, and FPGAs are becoming a viable option. While no alternative may ever completely replace CPUs or GPUs, they will only grow in diversity and efficiency. And every new hardware platform comes with its own software stack.
- Deployment scenarios are diversifying because of the sheer number of applications for modern ML beyond typical datacenter settings. Unconventional systems from edge infrastructure to tiny embedded devices all demand state-of-the-art ML capabilities.
Each dimension alone has made ML engineering harder, but their combination threatens to make it intractable. Multiply M models by H hardware accelerators and D deployment targets: building, testing, and shipping M×H×D individual configurations can quickly explode into overwhelming complexity.
Modularity is the key to managing this combinatorial complexity explosion. ML frameworks and tools need interfaces and abstraction layers that let users plug in their own capabilities without cross-cutting changes.
Operators are the most basic form of modularity in ML. Modern frameworks build models out of elementary tensor operators like generalized matrix multiplication or normalization. Each operator has handcrafted implementations tuned to run efficiently on specific hardware. In most frameworks, these basic operators are baked into the software stack—adding a new one requires forking and recompiling the framework itself.
Custom operators can let ML systems fluidly take advantage of new hardware that excels at a specific task. Or it can let models experiment with new computational components that are not popular enough yet to appear in mainstream frameworks. TVM’s TIR, for example, can import external functions and incorporate them into complete model implementations. New operators that plug into TVM coexist with its built-in operators, compiler pipeline, and auto-tuning—so engineers can extend the framework’s functionality incrementally without sacrificing its ability to optimize.
Plugging in individual operators works to incorporate special-purpose hardware with narrow functionality. But some hardware needs a more holistic view on entire ML model components. Think of reconfigurable hardware or general-purpose AI accelerators in mobile devices, for example: to reach peak performance, this kind of hardware needs to transform and tune ML computations to match its unique capabilities.
The key to exploiting new hardware platforms is modularity in the compiler backend. A good ML compiler toolchain offers generic capabilities to split and recombine operators, parameterize their implementations, specialize them for specific use cases, and tune them according to a target’s cost model. A modular ML compiler avoids tying these capabilities to any single target machine.
This decoupling lets new hardware platforms focus on its own unique performance concerns—not on recapitulating the generic optimizations common to all tensor computation engines. In TVM, platforms use the Bring Your Own Codegen (BYOC) interface to plug in new backends that automatically inherit all the optimization and tuning capabilities that exist higher in the software stack.
In the world of traditional compilers, like LLVM and GCC, there is a sharp distinction between the people who use compilers and the people who write them. These compilers provide a standard set of optimizations that work well on average for most code, and typical developers never need to think about them.
In ML engineering, the distinction is blurrier. One-size-fits-all optimizations work well most of the time, but they cannot always anticipate the rapid changes that ML models make. Standard optimizations can take time to catch up to changes in models. The result is that, even on conventional hardware, ML engineers need to customize the way code gets generated for a given model.
In-house compiler optimizations are a critical tool for limiting the combinatorial ML complexity explosion. Engineers can use targeted optimizations to enhance the performance of an entire class of models without slowing down scientists’ development. TVM’s Relay IR makes it straightforward to plug custom compiler passes into its standard suite of general-purpose optimizations.
Modular Deployment Pipelines
Beyond the code itself, deploying an optimized ML model brings another layer of engineering challenges. Any nontrivial model entails much more than a set of parameters and some Python code: there are complex dependencies on specific library versions, custom data ingestion pipelines, and subtle interactions with the platform the model runs on. The result is an endless stream of surprises when transitioning models from development to each deployment scenario: each cloud, mobile device, and embedded system brings its own unique set of quirks.
Containerization revolutionized classic software deployment with sealed environments that transition smoothly from development to deployment without unpleasant surprises. Docker by itself, however, is not enough to solve the problems particular to ML deployment: it does not help with scaling model architectures up and down according to performance and accuracy demands or with managing each model’s associated data pipeline. Today’s containers also do not help with hardware portability: they encapsulate code for x86 or Arm, for example, but not both. Even within one ISA, containers cannot optimized for specific microarchitectures the way that custom-compiled code can. Hardware-specific containers may not be so bad for covering the handful of mainstream CPU architectures, but AI deployment today entails portability to dozens of specialized hardware accelerators.
The era of heterogeneous ML needs an analogy to OCI containers. ML model containers should provide a uniform interface to models that stays consistent across developers’ laptops, full-scale datacenter deployments, and low-power edge devices. A stable container interface would let ML engineers build generic deployment pipelines that are fully abstracted from the models they orchestrate. If ML containers bundle optimized code for many architectures together into a single unit, deployment infrastructure could make optimal choices about how to map the model onto an ensemble of hardware targets. A portable container abstraction, modernized for the ML era, could be the key to containing the combinatorial complexity explosion.