The way we speak—not just what we say—influences how artificial intelligence perceives and responds to us. A new study reveals that speech foundation models, AI systems that process raw audio, exhibit systematic biases based on subtle vocal qualities like breathiness and creakiness. These biases mirror human stereotypes, potentially amplifying discrimination in automated hiring, therapy, and other sensitive applications.
Researchers discovered that AI models consistently associate different voice qualities with specific personality traits and capabilities. Breathy voices, often linked with intimacy in human perception, prompted AI to rate speakers higher on empathy and emotional validation. In contrast, creaky voices—associated with authority or disengagement—led to higher ratings for leadership endorsement but lower ratings for emotional warmth. These patterns emerged across multiple evaluation dimensions including role status, salary offers, and heroic agency in generated stories.
To probe these effects, the team created VQ-Bench, a parallel dataset featuring identical speech content delivered with different vocal qualities. Using state-of-the-art voice conversion technology, they synthesized modal (normal), breathy, creaky, and end-creak versions of the same prompts while keeping speaker identity and linguistic content constant. This controlled approach allowed researchers to isolate the impact of voice quality alone on AI behavior.
The results showed striking consistency. In hiring scenarios, breathy voices received lower salary offers and leadership endorsements compared to modal voices, particularly for female speakers. In therapeutic contexts, breathy voices elicited more supportive responses, while creaky voices prompted more reserved reactions. Emotion recognition systems also showed systematic shifts—breathy voices increased classifications of 'calm' while decreasing 'fearful' and 'happy' predictions.
These findings matter because AI systems are increasingly deployed in high-stakes domains where vocal cues might influence outcomes. If foundation models encode and reproduce human biases based on voice quality, they could disproportionately disadvantage certain speakers in job interviews, therapy sessions, or automated screening systems. The study specifically notes that current voice quality distinctions are limited to binary gender categories, highlighting the need for more inclusive testing with gender-ambiguous and nonbinary voices.
While the research demonstrates clear patterns, several limitations remain. The study focused on American English speakers and specific vocal qualities, leaving open questions about how these effects might vary across languages and cultures. Additionally, the current analysis cannot determine whether AI models are simply reproducing documented human biases or creating new forms of discrimination. As speech AI becomes more integrated into daily life, understanding and addressing these subtle biases becomes crucial for ensuring fair and equitable automated systems.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn