Explore the latest OctoML product updates, TVM updates, and overall machine learning news.
OctoML drives down production AI inference costs at Microsoft through new integration with ONNX Runtime ecosystem
Today, we’re excited to announce the results from our second phase of partnership with Microsoft to continue driving down AI inference compute costs by reducing inference latencies. Over the past year, OctoML engineers worked closely with Watch For to design and implement the TVM Execution Provider (EP) for ONNX Runtime - bringing the model optimization potential of Apache TVM to all ONNX Runtime users. This builds upon the collaboration we began in 2021, to bring the benefits of TVM’s code generation and flexible quantization support to production scale at Microsoft.
DIY Gen AI: Evaluating GPT-J Deployment Options
Model optimizations can save you millions of dollars over the life of your application. Additional savings can be realized by finding the lowest cost hardware to run AI/ML workloads. Companies building AI-powered apps will need to do both if they want a fighting chance at building a sustainable business.