StutterGPT: Evaluating AI Speech Models with Stuttering
“I didn’t quite catch that, can you say that again?”
With each advancement in AI, an often-overlooked group stands to benefit immensely: people who stutter.
Stuttering, a genetic speech disorder that affects ~1% of the population, poses unique challenges for automatic speech recognition (ASR) systems.
I know this because I am a lifelong person who stutters.
For 80 million people who stutter (PWS) like me, interactions with early speech assistants like Alexa or Siri resulted in frustrating experiences, punctuated by the all-too-familiar refrain: “I didn’t quite catch that, can you say that again?”
In college, I spent a couple years building mobile apps for speech therapy— years before LLMs proliferated. And while issues affecting 1 in 100 users might feel like an edge case, they are a daily reality for PWS.
Today I’m working at ElevenLabs, a leading AI audio research and deployment company—a full decade after building speech apps.
And with all the intervening progress in AI speech tech, I wanted to evaluate how today’s leading AI models understand stuttering.
Defining a stuttering benchmark
Stuttering comes in three main forms: repetitions (“my my my”), blocks (“m…..”), and prolongations (“mmmmm”). These varied disfluencies are disruptive to the normal flow of speech, and each present unique challenges for interpreting speech outputs.
There’s no inherent reason why AI models couldn’t be trained to understand stuttered speech accurately. Put very simply, these automatic speech recognition (ASR) models work by extracting relevant features from audio waves and convert this signal into corresponding text.
So I first created my own evaluation to benchmark ASR with stuttering samples. Using other evaluation frameworks that test models against human-level performance (MMLU, HumanEval, ARC) for inspiration, I scored the word accuracy rate (WAR) of speech-to-text transcriptions of stuttering in audio samples against the speaker’s actual words.
For these tests, I used 9 audio samples of stuttering, including three of my own speech, each ~90 seconds in length with varying levels of stuttering severity (mild, moderate, and severe).
I then ran each audio sample through leading speech-to-text models: ElevenLabs Scribe, AssemblyAI, OpenAI Whisper, and Deepgram.
Here’s a sample you can listen to using AssemblyAI’s Playground: severe stuttering sample.
Evaluating the results
Results originally published: November 2024, Updated for ElevenLabs Scribe: March 2025
The findings were remarkable. Each model achieved greater than 94% accuracy rates averaged across all stuttering samples.
Presumably trained on fluent speech, these models had little difficulty transcribing stuttering at comparable levels to their published WAR benchmarks for all speech.
- ElevenLabs Scribe v1 scored highest with 98.7% average accuracy
- AssemblyAI’s LeMUR Best model scored 97.3%
- Deepgram Nova2 scored 94.3%
- OpenAI’s Whisper Large v3 model scored 94.0%
These results are particularly impressive considering that stuttering occurred in nearly one in four words in the test samples (~75% fluency).
Yet no two stutters are the same; the models had greater success transcribing stuttering with blocks (when no audio is present), even if these instances of stuttering were objectively more severe and less fluent.
Each AI model demonstrated unique strengths and weaknesses:
- ElevenLabs Scribe was best overall and most sensitive at identifying stuttering (“-i -i is”), diarizing multiple speakers (“Speaker 1” “Speaker 2”), and even annotating inaudible moments of stuttering ( “[Stuttering] It was totally overw- w- w- whelming”). While most accurate, Scribe was most sensitive to including all stuttered words as opposed to translating them into single words.
- AssemblyAI excelled at translating repetitions (“my my my” → “my”), removing filler words (“uh my um my like”) common in stuttered speech, and determining appropriate punctuation.
- OpenAI’s Whisper model was adept at removing filler words, but struggled with sentence-end punctuation, causing run-on sentences.
- Deepgram showed higher sensitivity to repetitions (“my my my” →
“my my my”) and had more basic transcription errors during stuttered speech (“in stuttering” → “in tutoring”).
Generating stuttered speech
If AI models can interpret stuttered speech, how well can they generate stuttering voices?
For this question, I turned to audio generation models. ElevenLabs has proven highly capable in increasing accessibility for those who have lost their voices, including a US Congresswomen and a CEO.
To test its stuttering capabilities, I first uploaded one of my disfluent audio samples to create a new AI voice. ElevenLabs flawlessly generated a fluent AI voice clone from my disfluent input. You can listen to my AI voice here on my personal site.
ElevenLabs AI voice generation could filter out stuttering in new AI voices, but could it add it back in?
It can. I used ElevenLabs’ Voice Changer speech-to-speech (STS) prompts to re-introduce stuttering. By recording a stuttered audio sample, ElevenLabs rendered a stuttered output while maintaining the fidelity of the original voice’s sound and style.
So what’s the big deal?
These AI advancements are incredibly promising. Imagine how fine-tuning each model with data from PWS could yield even more impressive results.
Along this direction, Apple’s Machine Learning Research team has published recent findings for stuttering with Siri. Apple Researchers managed to decrease instances of cutting off people who stutter mid-sentence by up to 79% while decreasing Siri’s word error rates from 25% to just 10% with explicit fine-tuning for PWS.
A key finding from Apple’s research was that increasing the threshold time for the end of spoken words — giving PWS more time to finish their thoughts — vastly improved model performance for stuttering.
If we can program AI to be a more patient listener for people who stutter, can we as humans learn to do the same?
As AI models progress, an important question emerges: If we can program AI to be a more patient listener for people who stutter, can we as humans learn to do the same?
Thanks for reading. I’d love to hear your thoughts and feedback on LinkedIn.