AI Finds the Best Way to Identify Speakers

TL;DR

A new scoring method is mathematically proven to improve speaker recognition and verification in voice systems, with no complex adjustments needed.

Speaker recognition technology, used in everything from unlocking smartphones to securing bank accounts, relies on scoring methods to decide if a voice matches a known speaker. Researchers have now established a theory showing that a specific scoring approach, called normalized likelihood (NL), is mathematically optimal for minimizing errors in both speaker identification and verification tasks. This finding provides a clear benchmark for improving real-world systems where accuracy is critical, such as in authentication and surveillance.

In speaker recognition, systems perform two main tasks: identification, which picks the correct speaker from a set of candidates, and verification, which checks if a voice belongs to a claimed speaker. The key discovery is that normalized likelihood scores achieve minimum Bayes risk (MBR), meaning they minimize the average error in decisions. For identification, this means selecting the speaker with the highest probability based on the score, while for verification, it involves comparing the score to a threshold. When the underlying data follows a Gaussian (normal) distribution, NL is equivalent to the widely used probabilistic linear discriminant analysis (PLDA) likelihood ratio, and common approximations like Euclidean and cosine distances can perform nearly as well under certain conditions.

The researchers derived this result using decision theory, focusing on how to make the best choices with available data. They modeled speaker characteristics as vectors in a high-dimensional space, where each speaker is represented by a mean vector and variances account for differences between and within speakers. In simulations, they tested scenarios with known and unknown speaker means—reflecting real cases where enrollment data is limited. For example, with known means and 80 dimensions, all scores (NL, Euclidean, and cosine) showed similar identification rates, but for verification, Euclidean distance performed poorly when within-speaker variance was large, while cosine distance closely matched NL.

Results from the simulation experiments highlight the importance of data dimensions and variances. Higher dimensions improved performance significantly, as they allow better separation of speakers. For instance, in verification tasks with unknown means, cosine scores approximated NL well, especially when within-speaker variance was high, reducing the equal error rate (EER)—a key metric where false acceptances and rejections are balanced. This analysis helps explain why simple scores like cosine distance are effective in practice, guiding developers toward more reliable systems without overcomplicating designs.

The implications extend to everyday applications, such as voice assistants and security systems, where accurate speaker recognition prevents fraud and enhances user experience. By clarifying the optimal scoring method, this research encourages focusing on normalizing speaker vectors to match assumed distributions, rather than developing complex, ad-hoc algorithms. It also underscores that direction-based metrics (like cosine distance) often outperform magnitude-based ones in high-dimensional spaces, aligning with trends in other fields like information retrieval.

Limitations noted in the paper include the assumption of Gaussian distributions, which may not hold in all real-world scenarios, and the need for further research on non-linear models. The simulations assumed ideal conditions, and performance could vary with noisy or mismatched data, highlighting areas for future improvement in handling real-life imperfections.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn