A new approach to training AI for text-to-speech systems has shown that balancing the learning of speech sounds and emotional tone is crucial for creating natural-sounding voices. Researchers from **MTUCI** in Moscow developed a method, as described in the paper on arXiv, that combines two training stages: one that teaches the AI to understand language structure and another that aligns text with audio to capture prosody—the rhythm, stress, and intonation of speech. This two-stage curriculum, tested on Russian language data, produced the best results in terms of intelligibility, speaker similarity, and overall quality, outperforming other approaches that focus too heavily on prosody at the expense of sound discrimination.
The findings highlight a key insight: improving how AI retrieves prosodic information does not always lead to better speech generation, emphasizing the need for a **balanced training strategy**.
## How the Two-Stage Training Works
The core finding is that a two-stage training process yields the most effective AI for text-to-speech. In the first stage, the AI learns through **masked language modeling**, where parts of the text are hidden and the model must predict them, building a foundation in language structure.
The second stage uses **cross-modal contrastive learning** with mixed-phoneme batches, where the AI aligns text embeddings with audio embeddings while distinguishing between different phonemes—the basic units of sound in speech. This combination achieved the highest scores in perceptual metrics, with a mean opinion score (MOS) of **1.980** and an intonation MOS of **2.185**. It also resulted in the lowest word error rate (WER) of **0.176** and highest speaker similarity (SIM-o) of **0.862**, indicating clearer and more authentic-sounding speech.
## Architecture and Training Details
The methodology involves a **dual-stream encoder architecture** that processes both phoneme sequences and BPE (byte-pair encoding) text tokens, with speaker conditioning integrated via **AdaLN-Zero** to account for individual voice characteristics.
The training curriculum progresses through stages: Stage 1 applies masked language modeling independently to phoneme and BPE encoders for **75,000 steps**. Stage 2 trains the full encoder jointly with an ECAPA-TDNN acoustic branch using a SigLIP-style contrastive loss over **20,000 steps**, with batches containing diverse phoneme types to teach phoneme discrimination.
An optional Stage 3, studied separately, refines prosody using same-phoneme batches but was found to degrade performance. The encoder was evaluated in two downstream systems: Grad-TTS for rapid ablations and DiTTo-TTS, a latent-diffusion model, for comparison with larger-scale systems.
## Key Results and the Prosody-Discrimination Trade-Off
Results from the paper demonstrate that the two-stage curriculum outperforms other variants across multiple metrics. In intrinsic retrieval tests, Stage 2 alone achieved the highest phoneme discrimination score (**R@1-diff of 0.933**), while the 1+2 combination balanced this with prosodic sensitivity (**R@1-sim of 0.746**).
Downstream synthesis with Grad-TTS confirmed these advantages, with the 1+2 curriculum leading in MOS, IntMOS, WER, and SIM-o. However, adding Stage 3, which improved prosodic retrieval to an R@1-sim of **0.770**, caused a drop in phoneme discrimination to **0.893** and worsened synthesis quality, with MOS falling to **1.540** and WER rising to **0.429**.
This indicates that gains in embedding-space metrics do not translate to better generative performance, a critical disconnect highlighted by the researchers.
## Implications for Natural Speech Synthesis
The implications of this research are significant for developing more natural and controllable text-to-speech systems, especially for languages like Russian with complex prosodic features. By showing that phoneme discrimination and prosodic sensitivity must be trained jointly, the study offers a practical guideline for AI training curricula, potentially improving applications in **virtual assistants**, **audiobooks**, and **accessibility tools**.
The comparison with contemporary systems reveals that the DiTTo-TTS system using the best encoder achieved competitive spectral fidelity and intelligibility, though it lagged in subjective ratings compared to models like F5-TTS, suggesting room for further refinement in perceived naturalness.
## Limitations and Future Directions
Limitations of the study include the focus on Russian language data, which may not generalize to other languages without adaptation. The paper notes that same-phoneme refinement in Stage 3 led to **catastrophic forgetting** of phoneme-discriminative features, reducing synthesis quality despite better prosodic retrieval, and the contrastive signal in this stage was too weak for stable refinement.
Additionally, the evaluation against other TTS systems involved different training data and model capacities, making direct comparisons challenging. Future work could explore scaling to more languages or integrating it with advanced generative models to enhance overall speech naturalness.
Sources & References
- Combining Masked Language Modeling and Cross-Modal Contrastive Learning for Prosody-Aware TTS — arXiv
- Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech — arXiv
- DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech — arXiv
- F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching — arXiv
- ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification — arXiv
- Sigmoid Loss for Language Image Pre-Training (SigLIP) — arXiv
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn