What is WhisperX?

from octoai.client import Client
import base64
whisper_url = "https://your-whisper-demo.octoai.run/predict"
whisper_health_check = "https://your-whisper-demo.octoai.run/healthcheck"
file_path = "your-sample-audio.wav"
with open(file_path, "rb") as f:
encoded_audio = base64.b64encode(f.read())
base64_string = encoded_audio.decode("utf-8")
inputs = {
"language": "en",
"task": "transcribe",
"audio": base64_string,
}
OCTOAI_TOKEN = "API Token goes here from guide on creating OctoAI API token"
client = Client(token=OCTOAI_TOKEN)
if client.health_check(whisper_health_check) == 200:
outputs = client.infer(endpoint_url=whisper_url, inputs=inputs)
transcription = outputs["transcription"]
assert "She sells seashells by the seashore" in transcription
assert (
"She sells seashells by the seashore"
in outputs["response"]["segments"][0]["text"]
)
Fine tuning supported
Customize Whisper with your own audio data to create specialized speech models, such as medical or legal conversation and dictation use cases. We are currently working with design partners for this feature. Contact us to get early access.
Recognize new terms
Learn new dialects and accents
Make legal or healthcare specific models
Human to AI conversation

Speech recognition at your need for speed
Recognize speech at fast human (cadence) level on faster GPU hardware or slower for historical batch processing at a 6x lower cost with WhisperX on OctoAI.
Times are cumulative, so diarization includes transcription and alignment
- Audio used for timing
Benchmark run on A10g
WhisperX transcription on A10G results

transcribe: 1 seconds in 0.37253355979919434
transcribe: 20 seconds in 1.2478430271148682
transcribe: 60 seconds in 1.8664119243621826
transcribe: 300 seconds in 4.814593553543091
transcribe: 1200 seconds in 34.68618607521057
transcribe: 3600 seconds in 42.72543740272522
transcribe: 7200 seconds in 87.29967927932739
transcribe: 14400 seconds in 178.04800415039062
transcribe: 28800 seconds in 344.45165491104126
transcribe: 57600 seconds in 484.6209282875061
Time (seconds) Acceleration Audio File
5843
Transcription (1)
96
61x faster
Alignment (2)
191
36x faster
Diarization (3)
308
18x faster
Time (seconds) | Acceleration | |
---|---|---|
Audio File | 5843 | |
Transcription (1) | 96 | 61x faster |
Alignment (2) | 191 | 36x faster |
Diarization (3) | 308 | 18x faster |
Transcription converts speech to text.
Alignment generates the start and stop time for each word in the transcript. This time includes both transcription and alignment.
Diarization identifies the speaker. This time includes transcription, alignment, and diarization.
Deploy with ease for web or mobile apps
We built a demo app integrating with React to show how easy it is to use Whisper on OctoAI for your web app. Check out the app, just a note you might experience a cold start and transcription processing might take a bit longer.
Whisper on OctoAI features
Features OctoAI High quality speech detection with reduced hallucinations
Yes
via WhisperX
Word level timestamp accuracy for utterance-level detection
Yes
via WhisperX
Trade speed for cost with fast detection for real-time applications or slow detection for batch processing
Yes
via WhisperX acceleration services
Features | OctoAI |
---|---|
High quality speech detection with reduced hallucinations |
|
Word level timestamp accuracy for utterance-level detection |
|
Trade speed for cost with fast detection for real-time applications or slow detection for batch processing |
|