In this article
Last week, The Information grabbed our attention with the following headline:
The Information - AI Compute Shortage
Now that we've used it to grab your attention, let me tell you a secret: there is plenty of AI compute available 🤫
It is an objective fact that there are not enough super powerful GPUs for all who want them. But is that all the compute there is? Of course not. The shortage we’re witnessing is, in fact, a highly inefficient use of GPU compute, which cuts against the grain of how clouds operate and are consumed.
The fact is that AI/ML workloads are substantially different from the rest of the application software that runs in the cloud. But the true compute requirements of running models in production (or ‘inference’, as it is known in techie speak) aren’t well understood outside of a small group of ML experts. Some models and model configurations are compute hogs – others are not. In the absence of clear guidance, everyone developing AI applications is clamoring for the latest and greatest hardware (mostly NVIDIA A100s) when in many cases an older generation of highly available, cost-effective GPUs will work.
In order to make this kind of insight available to anyone, we recently launched the OctoML Profiler to help people quickly evaluate the actual production compute requirements of ML models before building into their AI app or service. In a few lines of code, developers can compare different models and model configurations for both cost and speed. OctoML-profile also provides built-in optimization/acceleration techniques to ensure the model runs as fast as possible, making it more likely to be compatible with an older generation of GPUs.
Efficient AI Compute for All
The inefficient use of AI-targeted compute is what has motivated OctoML to expand the vision of our platform. For a couple of years now, we have been working with the likes of global organizations like Microsoft and Toyota to deliver the right balance of cost-efficiency and user experience for their AI-powered apps and services. However, the explosive market growth driven by generative AI has opened up the world of AI development to so many. It is clear that there is a much broader opportunity to help customers cost-efficiently use GPUs, CPUs and AI accelerators, and to intelligently select and migrate which hardware they should be working on.
Delivering a compute layer of this nature requires the ML systems, ML compilation/optimization, systems engineering and silicon intrinsic expertise that is concentrated in a few large players like OpenAI, Meta, Amazon and Google. In contrast, we want to make sophisticated AI infrastructure available to everyone as a layer on top of their cloud service.
The first iteration of our work will include various flavors of NVIDIA compute. However, we are trying to gauge interest in other AI computing frameworks like Inferentia from AWS and TPUs from Google/GCP to determine broad appetite for those. The challenge in getting access to these compute options is that AI/ML models are often developed against specific hardware choice and AI/ML software is notoriously non-portable. This hardware portability is what OctoML has had years of experience handling for our users and we know that these solutions from the cloud providers themselves can be an important part of addressing the “server availability” issue.
Help us build the world's easiest, most flexible, most efficient AI compute layer! Sign up for early access and become part of our OctoPod: a group of insiders who get first looks at our new products and features and exclusive access to fun swag. Sign up here.
DIY Gen AI: Evaluating GPT-J Deployment Options
Model optimizations can save you millions of dollars over the life of your application. Additional savings can be realized by finding the lowest cost hardware to run AI/ML workloads. Companies building AI-powered apps will need to do both if they want a fighting chance at building a sustainable business.
OctoML Profiler Provides Deep Intelligence for ML Models in PyTorch 2.0
With only three lines of code, see if your application can be accelerated on various targets. Examine the balance between accuracy, performance, and cost. And, all from your local developer machine.