Sign up
Log in
Sign up
Log in
Audio to Text

Whisper X (accelerated)

A general-purpose speech transcription model turning audio speech into text. It is trained on a large diverse dataset of audio and can perform multilingual speech: transcription, translation, and language identification.

Run Model

Advice from ML Experts

The hosted Whisper X model is a multi-task version which can not only transcribe audio in its original language, but also translate it into a target language at the same time (in this case english). The timestamps it generates can be used to help generate closed-captions for your audio. In general, it is more efficient and accurate to send a long audio clip to be processed at the same time rather than chopping things apart and processing them separately. However, for real time transcription use-cases, you will receive more immediate feedback by sending audio in 30 second clips.


Try our default audio sample or upload your own
Sample Audio

Click to upload or drag and drop WAV (max. 10MB)


Generated in: XX ms

The birch canoe slid on the smooth planks. Glued the sheet to the dark blue background. It is easy to tell the depth of a well. These days a chicken leg is a rare dish. Rice is often served in round bowls. The juice of lemons makes fine punch. The box was thrown beside the park truck. The hogs were fed chopped corn and garbage.

Run Time & Costs

Running Whisper X on OctoAI for audio transcription costs 6x less than Whisper X on OpenAI, based on internal benchmarks using OctoAI’s A10g tier. More detailed benchmarking of Whisper X on OctoAI is in progress, and performance data will be updated here once available.

Model Information

  • Developed by: OpenAI

  • Model type:  Automatic Speech Recognition System (ASR)

  • Language(s): English

  • License: The code and the model weights of Whisper are released under the MIT License

  • Model Description: Model Card and Paper

  • Resources for more information: GitHub Repository and Blog

  • Cite as:

  doi = {10.48550/ARXIV.2212.04356},
  url = {},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  keywords = {Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), Machine Learning (cs.LG), Sound (cs.SD), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision}
  publisher = {arXiv},
  year = {2022},
  copyright = { perpetual, non-exclusive license}