Whisper X (accelerated)
A general-purpose speech transcription model turning audio speech into text. It is trained on a large diverse dataset of audio and can perform multilingual speech: transcription, translation, and language identification.
Advice from ML Experts
The hosted Whisper X model is a multi-task version which can not only transcribe audio in its original language, but also translate it into a target language at the same time (in this case english). The timestamps it generates can be used to help generate closed-captions for your audio. In general, it is more efficient and accurate to send a long audio clip to be processed at the same time rather than chopping things apart and processing them separately. However, for real time transcription use-cases, you will receive more immediate feedback by sending audio in 30 second clips.
Input
Try our default audio sample or upload your ownClick to upload or drag and drop WAV (max. 10MB)
Output
Generated in: XX ms
Run Time & Costs
Running Whisper X on OctoAI for audio transcription costs 6x less than Whisper X on OpenAI, based on internal benchmarks using OctoAI’s A10g tier. More detailed benchmarking of Whisper X on OctoAI is in progress, and performance data will be updated here once available.
Model Information
Developed by: OpenAI
Model type: Automatic Speech Recognition System (ASR)
Language(s): English
License: The code and the model weights of Whisper are released under the MIT License
Model Description: Model Card and Paper
Resources for more information: GitHub Repository and Blog
Cite as:
@misc{https://doi.org/10.48550/arxiv.2212.04356,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
keywords = {Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), Machine Learning (cs.LG), Sound (cs.SD), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Robust Speech Recognition via Large-Scale Weak Supervision}
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}