TL;DR
The tech industry is pouring investment into voice as the primary AI interface, with Meta, Google and Apple all making acquisitions in the space. But persistent speech recognition problems — including OpenAI’s Whisper model randomly switching English to Welsh — show that building reliable voice AI is harder than it looks. Error rates are falling fast, but the stakes rise as voice moves into healthcare and autonomous vehicles.
The Welsh problem
OpenAI’s Whisper speech-to-text model has spent much of the past year occasionally translating English input into Welsh — not mishearing similar-sounding words, but actually translating them. The cause turned out to be mislabelled training data, a mundane weakness in a sophisticated system that took months to untangle. OpenAI says the latest model update should fix it.
The glitch illustrates a broader truth about conversational voice AI: anything less than perfect conditions — background noise, accents, overlapping speech, unusual requests — raises the chance of error. High-quality recorded datasets are scarcer than text ones, processing times are higher, and humans are acutely sensitive to mis-steps in speech. Even a few extra milliseconds of silence between speakers feels wrong.
Big bets on voice
None of this is dampening industry enthusiasm. Meta bought Play AI, a conversational voice model start-up, last summer. Google hired the founder of Hume, known for analysing vocal emotions. Apple acquired Q.ai, an Israeli company that tracks facial muscles during speech, enabling comprehension even when the speaker cannot be heard.
Jony Ive, the former iPhone designer, is building a mystery device with OpenAI that is expected to be audio-directed rather than touch-based. Sam Altman has described the device’s feel as “sitting in the most beautiful cabin by a lake.”
Looking forward
Speech recognition accuracy is improving rapidly. On the open source automatic speech recognition leaderboard, OpenAI’s Whisper has a word error rate of 7.44%, down from over 8% months ago. Nvidia’s Canary model leads at 5.63%. But as voice AI moves from living rooms into operating theatres and moving vehicles, the tolerance for errors drops sharply. A chatbot switching languages on a sofa is annoying; the same glitch in an autonomous car at 70mph is a different matter entirely.