Text-to-audio models, which generate sounds from text prompts, are increasingly used in games and media for creating sound effects and music. However, it has been unclear how varied and faithful these AI-generated sounds are. Researchers from the New Jersey Institute of Technology, University of Gothenburg, and American University conducted the first systematic analysis of these models' outputs, revealing significant inconsistencies in sound diversity and accuracy that could impact their reliability for professional applications.
The key finding is that text-to-audio models exhibit wide variation in how they interpret and generate sounds for the same prompt, with some models producing highly repetitive outputs or sounds that do not match the prompt description. For example, when prompted with 'thunder', Stable Audio Open generated distinct thunderclaps with varying timing and loudness, while MMAudio often included rain sounds not requested, resulting in muffled thunder. This indicates that models do not uniformly capture the intended audio characteristics, leading to unpredictable results.
The methodology involved adapting expressive range analysis (ERA), a technique from procedural content generation in games, to evaluate three open-source models: Stable Audio Open, MMAudio, and AudioLDM. The researchers used fixed prompts derived from the Environmental Sound Classification (ESC-50) dataset, such as 'sound of helicopter' or 'crying baby', and generated 100 audio samples per prompt per model. They analyzed these samples along acoustic dimensions like pitch, loudness, and timbre, using tools like spectrograms and root-mean-square (RMS) loudness measurements to quantify variations. For instance, they computed features such as fundamental frequency for pitch and Mel-frequency cepstral coefficients (MFCCs) for timbre, then applied principal components analysis (PCA) to visualize the data in two-dimensional plots.
Results from the data show clear differences between models. In the thunder example, Stable Audio Open produced outputs with loudness peaks early in the clip (e.g., within the first two seconds), indicating distinct thunderclaps, while MMAudio's outputs had no consistent peak timing and lower variation. For the 'helicopter' prompt, Stable Audio Open consistently generated fly-by sounds with loudness peaks 3-5 times the average, unlike the other models. Overall, summary statistics across 50 labels revealed that AudioLDM had the highest output variance, even exceeding the ESC-50 reference dataset in pitch variation, but all models showed less diversity in loudness and timbre compared to hand-curated sounds. Figures 2 and 3 illustrate these patterns, with spectrograms and loudness plots highlighting how models differ in sound structure over time.
This research matters because text-to-audio models are used in interactive media, such as video games and virtual environments, where sound diversity and accuracy enhance user experience. If models produce repetitive or off-target sounds, it could limit their utility for creators seeking realistic and varied audio. The findings emphasize the need for better model evaluation to ensure they meet creative standards, potentially guiding improvements in AI training for more reliable sound generation.
Limitations of the study include its focus on fixed prompts, which does not capture the full generative space of text-to-audio models. The paper notes that models may generate sounds unrelated to the prompt, increasing apparent diversity in ways users do not desire, and that the analysis metrics, while general, may not distinguish between meaningful and irrelevant variations. Future work could explore prompt variations and broader sound categories to better understand model capabilities.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn