Relevant if you build with AI tools, APIs, or coding agents. Relevant als je bouwt met AI-tools, API's of coding agents.
Introducing Whisper Introducing Whisper
Read paper(opens in a new window)View code(opens in a new window)View model card(opens in a new window) Read paper(opens in a new window)View code(opens in a new window)View model card(opens in a new window)
Quick editorial signal Snelle redactionele duiding
- Track this as a OpenAI update, not just a standalone headline. Bekijk dit als OpenAI-update, niet alleen als losse headline.
- Useful for builders who need to understand API, coding, or workflow changes. Nuttig voor bouwers die API-, code- of workflowwijzigingen willen begrijpen.
- Likely worth revisiting after people have used the release in practice. Waarschijnlijk de moeite waard om opnieuw te bekijken zodra mensen het in praktijk gebruiken.
Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing.
The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.
Other existing approaches frequently use smaller, more closely paired audio-text training datasets,12, 3 or use broad but unsupervised audio pretraining.4, 5, 6Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.
About a third of Whisper’s audio dataset is non-English, and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.
We hope Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications. Check out thepaper(opens in a new window),model card(opens in a new window), andcode(opens in a new window)to learn more details and to try out Whisper.
We hope Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications. Check out thepaper(opens in a new window),model card(opens in a new window), andcode(opens in a new window)to learn more details and to try out Whisper.
* Whisper
* Language
References
1. 1
Chan, W., Park, D., Lee, C., Zhang, Y., Le, Q., and Norouzi, M. SpeechStew: Simply mix all available speech recogni- tion data to train one large neural network.arXiv preprint arXiv:2104.02133,2021(opens in a new window).
2. 2
Galvez, D., Diamos, G., Torres, J. M. C., Achorn, K., Gopi, A., Kanter, D., Lam, M., Mazumder, M., and Reddi, V. J. The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage.arXiv preprint arXiv:2111.09344,2021(opens in a new window).
3. 3
Chen, G., Chai, S., Wang, G., Du, J., Zhang, W.-Q., Weng, C., Su, D., Povey, D., Trmal, J., Zhang, J., et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio.arXiv preprint arXiv:2106.06909,2021(opens in a new window).
4. 4
Baevski, A., Zhou, H., Mohamed, A., and Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations.arXiv preprint arXiv:2006.11477,2020(opens in a new window).
5. 5
Baevski, A., Hsu, W.N., Conneau, A., and Auli, M. Unsu pervised speech recognition. Advances in Neural Information Processing Systems, 34:27826–27839,2021.
6. 6
Zhang, Y., Park, D. S., Han, W., Qin, J., Gulati, A., Shor, J., Jansen, A., Xu, Y., Huang, Y., Wang, S., et al. BigSSL: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition.arXiv preprint arXiv:2109.13226,2021(opens in a new window).
Related articles
View all
Hierarchical text-conditional image generation with CLIP latents Publication Apr 13, 2022
Solving (some) formal math olympiad problems Milestone Feb 2, 2022
Solving math word problems Publication Oct 29, 2021
Solving math word problems Publication Oct 29, 2021
Help shape what we cover next Help bepalen wat we hierna volgen
Anonymous feedback, no frontend account needed. Anonieme feedback, zonder front-end account.
Watch related videos Bekijk gerelateerde video's
Open videos → Open video's →
Introducing Sora 2 Introductie van Sora 2
Bill Peebles, Rohan Sahai, and Thomas Dimson introduce and demo Sora 2 and the new Sora app. https://openai.com/index/sora-2/ Bill Peebles, Rohan Sahai en Thomas Dimson stellen Sora 2 en de nieuwe Sora-app voor en demonstreren deze. https://openai.com/index/sora-2/
Introducing GPT-5 GPT-5 geïntroduceerd
Introducing GPT-5, our best AI system yet! GPT-5 features state-of-the-art performance across coding, math, writing assistance, health, visual perception, and more. Use GPT-5 to build websites, create apps, and tap into its improved writi... Maak kennis met GPT-5, ons beste AI-systeem tot nu toe! GPT-5 biedt toonaangevende prestaties op het gebied van coderen, wiskunde, schrijfondersteuning, gezondheid, visuele waarneming en meer. Gebruik GPT-5 om websites en apps te bouwen, en maak gebruik van de verbeterde schrijfmogelijkheden voor alledaagse taken zoals rapporten, e-mails en redigeren.
Introducing GPT-5 Introductie van GPT-5
Sam Altman, Greg Brockman, Sebastien Bubeck, Mark Chen, Yann Dubois, Brian Fioca, Adi Ganesh, Oliver Godement, Saachi Jain, Christina Kaplan, Christina Kim, Elaine Ya Le, Felipe Millon, Michelle Pokrass, Jakub Pachocki, Max Schwarzer, Renni... Sam Altman, Greg Brockman, Sebastien Bubeck, Mark Chen, Yann Dubois, Brian Fioca, Adi Ganesh, Oliver Godement, Saachi Jain, Christina Kaplan, Christina Kim, Elaine Ya Le, Felipe Millon, Michelle Pokrass, Jakub Pachocki, Max Schwarzer, Renni...
More from OpenAI Meer van OpenAI
All updates Alle updatesOur principles Our principles
By Sam Altman By Sam Altman
GPT-5.5 Bio Bug Bounty GPT-5.5 Bio Bug Bounty
Title: GPT-5.5 Bio Bug Bounty Titel: GPT-5.5 Bio Bug Bounty
How to get started with Codex Zo begin je met Codex
Tips to set up Codex, create your first project, and start completing real tasks. Tips om Codex in te stellen, je eerste project te maken en echte taken af te ronden.
What is Codex? Wat is Codex?
Understand what Codex is and how it fits into your work Begrijp wat Codex is en hoe het in je werk past