AI Models Fail to Read Musical Scores Like Humans

TL;DR

A new benchmark shows AI models struggle with full musical scores. Text-based notation offers a promising fix for visual recognition gaps.

A new study exposes a significant gap in artificial intelligence's ability to comprehend music as humans do. Researchers have developed the first large-scale benchmark to evaluate how well AI models understand complete musical scores, from basic notation to complex harmonic structures. reveal that current state-of-the-art models perform poorly, particularly when trying to read music from visual sheet music images, highlighting a fundamental in multimodal AI reasoning. This work, detailed in a preprint paper, introduces the Musical Score Understanding Benchmark (MSU-Bench), which tests models on 1,800 human-curated question-answer pairs across 150 scores by composers like Bach, Beethoven, and Chopin. The benchmark aims to push AI beyond simple pattern recognition toward genuine musicological analysis, an area where existing systems often hallucinate or mislocalize information, as shown in Figure 1 of the paper where models fabricate articulations not present in the actual score.

The researchers found that AI models struggle to sustain correct answers across progressive levels of musical comprehension. In zero-shot evaluations, the best-performing model, Gemini 2.5 Pro, achieved only 49.44% overall accuracy when answering questions based on text-based ABC notation, a symbolic format that encodes music in human-readable characters. Performance dropped sharply for visual questions based on PDF scores, with the top model, Claude Opus 4, reaching just 24.22% accuracy. The study identified a clear modality gap: models are significantly better at processing ABC notation than raw score images, where errors in bar localization and recognition dominate. For example, in textual QA, Gemini 2.5 Pro scored 65.33% on Level 1 questions about basic metadata like composer and tempo, but this fell to 56.00% on Level 2 questions about local notational features, and further declined at higher levels. Visual QA performance was even more fragile, with most models failing to exceed 20% overall, and smaller variants like Qwen2.5-VL-7B-Instruct collapsing entirely to 0.00% accuracy.

Ology involved creating MSU-Bench with four hierarchical levels of questions, each designed to test increasingly complex aspects of musical understanding. Level 1 focuses on onset information such as composer, title, and tempo; Level 2 on notation and note-level features like dynamics and pitch; Level 3 on chord and harmony analysis; and Level 4 on texture and form, including motifs and thematic development. The benchmark uses both textual QA with ABC notation and visual QA with PDF scores, allowing for multimodal evaluation. The researchers tested over 15 state-of-the-art models in zero-shot settings and also fine-tuned smaller models using LoRA, a lightweight adaptation technique. They introduced a Level-wise Success Rate (LSR) metric to measure how well models maintain correctness across successive levels, revealing that most fail to do so, as illustrated in Figure 4 where LSR drops steeply from Level 1 to Level 4.

Analysis of shows that fine-tuning with LoRA substantially improves performance, closing the gap between textual and visual QA. For instance, Qwen3-4B increased its overall accuracy from 23.05% to 46.94% in textual QA after adaptation, while Qwen2.5-VL-3B-Instruct improved from 24.72% to 50.00% in visual QA with PDF input. The study also found that asking questions one by one yields better performance than presenting all 12 questions for a score at once, suggesting that current models do not effectively leverage hierarchical scaffolding. Additionally, the researchers evaluated forgetting on the Massive Multitask Language Understanding (MMLU) benchmark and found that LoRA adaptation preserved general knowledge, with minimal performance drops in subjects like STEM and humanities, as shown in Table 4.

Of this research are significant for the future of AI in musicology and education. By highlighting the limitations of current models, MSU-Bench provides a rigorous foundation for developing AI systems that can genuinely understand and analyze music, potentially aiding in tasks like score transcription, music theory tutoring, and automated composition. The use of ABC notation as a more reliable representation than visual scores points to a practical approach for improving AI's musical reasoning, though the study notes that real-world applications often require handling visual sheet music directly. The benchmark's focus on complete scores, rather than isolated fragments, addresses a gap in existing research, as previous benchmarks like MusicTheoryBench and MusiXQA typically deal with monophonic music or multiple-choice tasks, lacking the complexity needed for holistic understanding.

However, the study acknowledges several limitations. The benchmark is based on 150 scores, primarily from Western classical music, which may not fully represent global musical diversity. While fine-tuning improves performance, it requires computational resources and may not scale easily to all models. The visual QA indicate that bar localization remains a persistent , leading to hallucinations where models invent content not grounded in the score, as depicted in Figure 1a. Furthermore, the LSR analysis shows that even the best models struggle to sustain multi-level comprehension, with few scores remaining answerable past Level 2 in visual QA. The researchers suggest that future work should explore better multimodal alignment techniques and more robust grounding mechanisms to address these issues, aiming for AI that can read music with the depth and accuracy of human experts.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn