← Back to OpenAI updates ← Terug naar OpenAI-updates
OpenAI ARTICLE ARTIKEL 21 September 2022 21 september 2022

Introducing Whisper Introducing Whisper

Read paper(opens in a new window)View code(opens in a new window)View model card(opens in a new window) Read paper(opens in a new window)View code(opens in a new window)View model card(opens in a new window)

Article details Artikelgegevens
AI maker AI-maker OpenAI Type Type Article Artikel Published Gepubliceerd 21 September 2022 21 september 2022 Updates Updates Videos Video's View original article Bekijk origineel artikel
Why it matters Waarom dit telt

Quick editorial signal Snelle redactionele duiding

3 min
Impact Impact

Relevant if you build with AI tools, APIs, or coding agents. Relevant als je bouwt met AI-tools, API's of coding agents.

Audience Voor wie Developers Developers
Level Niveau Expert Expert
  • Track this as a OpenAI update, not just a standalone headline. Bekijk dit als OpenAI-update, niet alleen als losse headline.
  • Useful for builders who need to understand API, coding, or workflow changes. Nuttig voor bouwers die API-, code- of workflowwijzigingen willen begrijpen.
  • Likely worth revisiting after people have used the release in practice. Waarschijnlijk de moeite waard om opnieuw te bekijken zodra mensen het in praktijk gebruiken.
model apps developers creative

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing.

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

Other existing approaches frequently use smaller, more closely paired audio-text training datasets,12, 3 or use broad but unsupervised audio pretraining.4, 5, 6Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.

About a third of Whisper’s audio dataset is non-English, and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.

We hope Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications. Check out thepaper⁠(opens in a new window),model card⁠(opens in a new window), andcode⁠(opens in a new window)to learn more details and to try out Whisper.

We hope Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications. Check out thepaper⁠(opens in a new window),model card⁠(opens in a new window), andcode⁠(opens in a new window)to learn more details and to try out Whisper.

* Whisper

* Language

References

1. 1

Chan, W., Park, D., Lee, C., Zhang, Y., Le, Q., and Norouzi, M. SpeechStew: Simply mix all available speech recogni- tion data to train one large neural network.arXiv preprint arXiv:2104.02133,2021⁠(opens in a new window).

2. 2

Galvez, D., Diamos, G., Torres, J. M. C., Achorn, K., Gopi, A., Kanter, D., Lam, M., Mazumder, M., and Reddi, V. J. The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage.arXiv preprint arXiv:2111.09344,2021⁠(opens in a new window).

3. 3

Chen, G., Chai, S., Wang, G., Du, J., Zhang, W.-Q., Weng, C., Su, D., Povey, D., Trmal, J., Zhang, J., et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio.arXiv preprint arXiv:2106.06909,2021⁠(opens in a new window).

4. 4

Baevski, A., Zhou, H., Mohamed, A., and Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations.arXiv preprint arXiv:2006.11477,2020⁠(opens in a new window).

5. 5

Baevski, A., Hsu, W.N., Conneau, A., and Auli, M. Unsu pervised speech recognition. Advances in Neural Information Processing Systems, 34:27826–27839,2021.

6. 6

Zhang, Y., Park, D. S., Han, W., Qin, J., Gulati, A., Shor, J., Jansen, A., Xu, Y., Huang, Y., Wang, S., et al. BigSSL: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition.arXiv preprint arXiv:2109.13226,2021⁠(opens in a new window).

Related articles

View all

Hierarchical text-conditional image generation with CLIP latents Publication Apr 13, 2022

Solving (some) formal math olympiad problems Milestone Feb 2, 2022

Solving math word problems Publication Oct 29, 2021

Solving math word problems Publication Oct 29, 2021

Help shape what we cover next Help bepalen wat we hierna volgen

Anonymous feedback, no frontend account needed. Anonieme feedback, zonder front-end account.

Watch related videos Bekijk gerelateerde video's

Open videos → Open video's →
Introducing Sora 2
OpenAI Video Video
30 Sep 2025 30 sep. 2025

Introducing Sora 2 Introductie van Sora 2

Bill Peebles, Rohan Sahai, and Thomas Dimson introduce and demo Sora 2 and the new Sora app. https://openai.com/index/sora-2/ Bill Peebles, Rohan Sahai en Thomas Dimson stellen Sora 2 en de nieuwe Sora-app voor en demonstreren deze. https://openai.com/index/sora-2/

Open video → Open video →
Introducing GPT-5
OpenAI Video Video
8 Aug 2025 8 aug. 2025

Introducing GPT-5 GPT-5 geïntroduceerd

Introducing GPT-5, our best AI system yet! GPT-5 features state-of-the-art performance across coding, math, writing assistance, health, visual perception, and more. Use GPT-5 to build websites, create apps, and tap into its improved writi... Maak kennis met GPT-5, ons beste AI-systeem tot nu toe! GPT-5 biedt toonaangevende prestaties op het gebied van coderen, wiskunde, schrijfondersteuning, gezondheid, visuele waarneming en meer. Gebruik GPT-5 om websites en apps te bouwen, en maak gebruik van de verbeterde schrijfmogelijkheden voor alledaagse taken zoals rapporten, e-mails en redigeren.

Open video → Open video →
Introducing GPT-5
OpenAI Video Video
7 Aug 2025 7 aug. 2025

Introducing GPT-5 Introductie van GPT-5

Sam Altman, Greg Brockman, Sebastien Bubeck, Mark Chen, Yann Dubois, Brian Fioca, Adi Ganesh, Oliver Godement, Saachi Jain, Christina Kaplan, Christina Kim, Elaine Ya Le, Felipe Millon, Michelle Pokrass, Jakub Pachocki, Max Schwarzer, Renni... Sam Altman, Greg Brockman, Sebastien Bubeck, Mark Chen, Yann Dubois, Brian Fioca, Adi Ganesh, Oliver Godement, Saachi Jain, Christina Kaplan, Christina Kim, Elaine Ya Le, Felipe Millon, Michelle Pokrass, Jakub Pachocki, Max Schwarzer, Renni...

Open video → Open video →

More from OpenAI Meer van OpenAI

All updates Alle updates

Gemini komt eraan