Voice AI Is the Next Frontier Despite Speech Glitches

TL;DR

The tech industry is pouring investment into voice as the primary AI interface, with Meta, Google and Apple all making acquisitions in the space. But persistent speech recognition problems — including OpenAI’s Whisper model randomly switching English to Welsh — show that building reliable voice AI is harder than it looks. Error rates are falling fast, but the stakes rise as voice moves into healthcare and autonomous vehicles.

The Welsh problem

OpenAI’s Whisper speech-to-text model has spent much of the past year occasionally translating English input into Welsh — not mishearing similar-sounding words, but actually translating them. The cause turned out to be mislabelled training data, a mundane weakness in a sophisticated system that took months to untangle. OpenAI says the latest model update should fix it.

The glitch illustrates a broader truth about conversational voice AI: anything less than perfect conditions — background noise, accents, overlapping speech, unusual requests — raises the chance of error. High-quality recorded datasets are scarcer than text ones, processing times are higher, and humans are acutely sensitive to mis-steps in speech. Even a few extra milliseconds of silence between speakers feels wrong.

Big bets on voice

None of this is dampening industry enthusiasm. Meta bought Play AI, a conversational voice model start-up, last summer. Google hired the founder of Hume, known for analysing vocal emotions. Apple acquired Q.ai, an Israeli company that tracks facial muscles during speech, enabling comprehension even when the speaker cannot be heard.

Jony Ive, the former iPhone designer, is building a mystery device with OpenAI that is expected to be audio-directed rather than touch-based. Sam Altman has described the device’s feel as “sitting in the most beautiful cabin by a lake.”

Looking forward

Speech recognition accuracy is improving rapidly. On the open source automatic speech recognition leaderboard, OpenAI’s Whisper has a word error rate of 7.44%, down from over 8% months ago. Nvidia’s Canary model leads at 5.63%. But as voice AI moves from living rooms into operating theatres and moving vehicles, the tolerance for errors drops sharply. A chatbot switching languages on a sofa is annoying; the same glitch in an autonomous car at 70mph is a different matter entirely.

TL;DR

The Welsh problem

Big bets on voice

Looking forward

Share this article

Can ChatGPT Health Improve on 'Dr. Google' for Medical Queries?

Mother sues OpenAI, alleging ChatGPT encouraged suicide

OpenAI plans biggest ChatGPT overhaul into an AI 'superapp'

TL;DR

The Welsh problem

Big bets on voice

Looking forward

Share this article

Related Articles

Can ChatGPT Health Improve on 'Dr. Google' for Medical Queries?

Mother sues OpenAI, alleging ChatGPT encouraged suicide

OpenAI plans biggest ChatGPT overhaul into an AI 'superapp'