Here at OctoML, we are laser focused on always making the best, most performant generative AI models available to developers. Part of how we do this is by staying engaged with leading-edge research in the field and by having a pulse on the latest model innovations in the open source software (OSS) community. The team continuously reviews,evaluates models for production use cases, and incorporates the best ones into the OctoAI library. One of the models we have been focused on is Whisper for audio transcription. Today, we’re excited to share that we have added WhisperX to OctoAI, which brings multiple functionality and performance enhancements over Whisper large and Whisper large v2, and runs at 6x lower costs (compared to Whisper large v2 on OpenAI). WhisperX is available today in OctoAI.
From Audrey to Whisper - 70 years of speech recognition evolution
Audio transcription and Automatic Speech Recognition (ASR) have been topics of interest for applied automation from the early days of computers. ASR has come a long way from its start as the 6 foot “Audrey” Automatic Digit Recognizer from Bell Labs in 1952. We’ve seen IBM’s Shoebox, the growth in popularity of Hidden Markov models (HMMs) for word extraction, Dragon Systems’ “NaturallySpeaking”, and of course, the launch of Google Voice Search, Apple Siri and Amazon Alexa. Most recently, OpenAI released Whisper in 2022. Speech recognition continues to be one of the most tangible and immediately applicable areas where AI can reduce manual effort, increase quality and coverage, and free up resources for other activities. Use cases where speech recognition is actively applied today include customer service, call center productivity, video captioning, medical transcription and many others, and builders in these areas are actively exploring ways to improve quality and performance of their transcription.
Whisper and the resulting OSS innovation and expansion
Curating the best option for production use cases - WhisperX on OctoAI
A challenge we hear from developers building on these innovations, is that increased optionality doesn't immediately result in increased productivity or value. The teams and developers building audio transcription applications are focused on solving customer problems. They typically lack time and resources to thoroughly evaluate the quality and speed of every OSS model that might be released. This is where OctoAI’s active research engagement in the space and evaluation of the latest OSS models is valuable.
OctoAI’s Whisper research team shortlisted several options – based on internally assessed scores for quality, infrastructure needs, API surface and capabilities, and developer adoption, including OpenAI’s Whisper large-v2 and WhisperX. WhisperX was published by a group of researchers from the University of Oxford in March 2023, and attempts to address a number of the limitations in Whisper – including better support for long audio transcription, and addition of word level time stamps. WhisperX on OctoAI performed better in internal evaluations along multiple dimensions compared to alternatives, including ease of direct application for a broad range of use cases, accuracy of transcription, and speed/runtime for transcription - delivering a 5x cost savings over the equivalent from Deepgram, and over 6x cost savings over OpenAI.
Run WhisperX on OctoAI today
Get started with WhisperX on OctoAI today with a free trial on OctoAI. You’re also welcome to join us on our Discord server to engage with the team and our community.
We look forward to hearing from you!
Start building with ease in minutes using OctoAI
Our mission is empowering developers to build AI applications that delight users by leveraging fast models running on the most efficient hardware. Sign up and start building in minutes.