StutterGPT: Evaluating AI Speech Models with Stuttering
“I didn’t quite catch that, can you say that again?”
With each advancement in AI, an often-overlooked group stands to benefit immensely: people who stutter.
Stuttering, a genetic speech disorder that affects ~1% of the population, poses unique challenges for automatic speech recognition (ASR) systems.
I know this because I am a lifelong person who stutters.
For 80 million people who stutter (PWS) like me, interactions with early speech assistants like Alexa or Siri resulted in frustrating experiences, punctuated by the all-too-familiar refrain: “I didn’t quite catch that, can you say that again?”
In college, I spent a couple years building mobile apps for speech therapy— years before LLMs proliferated. And while issues affecting 1 in 100 users might feel like an edge case, they are a daily reality for PWS.
Today I’m working at ElevenLabs, a leading AI audio research and deployment company—a full decade after building speech apps.
And with all the intervening progress in AI speech tech, I decided to evaluate how today’s leading AI models understand and generate stuttering.
Defining a stuttering benchmark
Stuttering comes in three primary forms: repetitions (“my my my”), blocks (“m…..”), and prolongations (“mmmmm”). These varied disfluencies are disruptive to the normal flow of speech, and each present unique challenges for interpreting speech outputs.
There’s no inherent reason why AI models can’t be trained to understand stuttered speech accurately. Put very simply, these automatic speech recognition (ASR) models work by extracting relevant features from audio waves and convert this signal into corresponding text.
I created my own evaluation to benchmark ASR with stuttering samples. Using other evaluation frameworks that test models against human-level performance (MMLU, HumanEval, ARC) as inspiration, I scored the word accuracy rate (WAR) of speech-to-text transcriptions of stuttering in audio samples against the speaker’s actual words.
For these tests, I used 9 audio samples of stuttering, including three of my own speech, each ~90 seconds in length with varying levels of stuttering severity (mild, moderate, and severe).
I then ran each audio sample through three leading speech-to-text models: AssemblyAI, OpenAI Whisper, and Deepgram.
Here’s a sample you can listen to using AssemblyAI’s Playground: severe stuttering sample.
Evaluating the results
The findings were remarkable. Each model achieved greater than 90% accuracy rates when averaged across all samples.
Presumably trained on fluent speech, these models had little difficulty transcribing stuttering at comparable levels to their published WAR benchmarks for all speech.
- AssemblyAI’s “ LeMUR Best” model led with 97.3% average accuracy
- Deepgram “Nova2” achieved 94.3%
- OpenAI’s “Whisper Large v3” model reached 94.0%
These results are particularly impressive considering that stuttering occurred in nearly one in four words in the test samples (~75% fluency rate).
No two stutters are the same; the models had greater success transcribing stuttering with blocks (when no audio is present), even if these instances of stuttering were objectively more severe and less fluent.
Each AI model demonstrated unique strengths and weaknesses:
- AssemblyAI excelled at deciphering repetitions (“my my my” → “my”), removing filler words (“uh my um my like”) common in stuttered speech, and determining appropriate punctuation.
- OpenAI’s Whisper model was adept at removing filler words but struggled with sentence-end punctuation.
- Deepgram showed higher sensitivity to repetitions (“my my my” →
“my my my”) and had more basic transcription errors during stuttered speech (“in stuttering” → “in tutoring”).
Generating stuttered speech
If AI models can interpret stuttered speech, how well can they generate stuttering voices?
For this question, I turned to audio generation models. ElevenLabs has proven highly capable in increasing accessibility for those who have lost their voices, including a US Congresswomen and a CEO.
To test its stuttering capabilities, I first uploaded one of my disfluent audio samples to create a new AI voice. ElevenLabs flawlessly generated a fluent AI voice clone from my disfluent input. You can listen to my AI voice here on my personal site.
ElevenLabs AI voice generation could filter out stuttering in new AI voices, but could it add it back in?
It can. I used ElevenLabs’ Voice Changer speech-to-speech (STS) prompts to re-introduce stuttering. By recording a stuttered audio sample, ElevenLabs rendered a stuttered output while maintaining the fidelity of the original voice’s sound and style.
So what’s the big deal?
These AI advancements are incredibly promising. Imagine how fine-tuning each model with data from PWS could yield even more impressive results.
Along this direction, Apple’s Machine Learning Research team has published recent findings for stuttering with Siri. Apple Researchers managed to decrease instances of cutting off people who stutter mid-sentence by up to 79% while decreasing Siri’s word error rates from 25% to just 10% with explicit fine-tuning for PWS.
A key finding from Apple’s research was that increasing the threshold time for the end of spoken words — giving PWS more time to finish their thoughts — vastly improved model performance for stuttering.
If we can program AI to be a more patient listener for people who stutter, can we as humans learn to do the same?
As AI models progress, an important question emerges: If we can program AI to be a more patient listener for people who stutter, can we as humans learn to do the same?
Thanks for reading. I’d love to hear your thoughts and feedback on LinkedIn.