How AI Models Understand Idioms: A Circuit-Level Look

TL;DR

New research reveals the specific attention heads and transformer mechanisms that let language models decode figurative phrases correctly.

In the quest to understand how large language models process human language, researchers have turned their attention to one of the most challenging linguistic phenomena: idiomatic expressions. These non-compositional phrases—where the meaning can't be deduced from individual words—represent a crucial test case for understanding how transformers balance literal interpretation with contextual understanding. A new study from EPFL researchers provides unprecedented insight into how models like Gemma 2-2B process idioms through specialized computational circuits, revealing mechanisms that could inform everything from model interpretability to more robust language understanding systems.

The research, detailed in a recent preprint, demonstrates that idiom processing in transformers follows a distinct two-phase pattern occurring primarily in early network layers. Through systematic analysis of eight common English idioms including "a piece of cake" and "kicked the bucket," the researchers found that cross-token attention happens in layers 0-2, while semantic integration of figurative meaning occurs in layers 3-5. Most strikingly, nearly all idiomatic computation occurs on the final token of each expression, suggesting that transformers defer processing until all necessary context is available. This deferral represents an efficient computational strategy that avoids wasting resources on anticipatory calculations that might prove unnecessary.

Ology employed represents a significant innovation in mechanistic interpretability. Researchers developed a modified path patching algorithm based on the Automated Circuit (ACDC) framework but with crucial adaptations for idiom analysis. Instead of measuring changes in final output logits, they evaluated component importance based on cosine similarity changes in intermediate layers—specifically targeting the point where idioms achieve their figurative meaning. The approach involved generating carefully controlled corrupted versions of idioms where key tokens were replaced with semantically similar alternatives (like replacing "piece" with "chunk" or "slice" in "a piece of cake") while preserving grammatical structure and literal meaning but not figurative meaning. This allowed researchers to isolate which components were specifically responsible for idiomatic interpretation.

Three major emerged from the circuit analysis. First, the researchers identified specialized "Idiom Heads"—specific attention heads that consistently activate across multiple idioms. Head (2,0) in particular showed significant performance drops when patched across four different idioms, suggesting functional specialization for non-compositional language processing. Second, analysis of Query-Key space revealed that each idiom employs distinct directions rather than a universal "idiom direction," with diagonal entries in dot product tables showing values like 72 for "kicked the bucket" compared to non-diagonal entries as low as 8. Third, the study discovered "augmented reception"—a mechanism where early idiom processing creates enhanced receptivity between idiom tokens in later attention layers, allowing the model to more efficiently allocate computational resources.

Of these extend well beyond idiom processing alone. The identification of specialized attention heads parallels previous discoveries like Induction Heads and Duplicate Token Heads, suggesting that transformers develop functional specialization for various linguistic phenomena. The augmented reception mechanism provides insight into how models can efficiently process information over long contexts by modulating attention based on earlier processing. These discoveries could inform the development of more interpretable models, better evaluation s for language understanding, and potentially even improvements in how models handle other non-compositional phenomena like grammatical constructions or metaphorical language.

Several limitations warrant consideration in interpreting these . The study focused exclusively on English idioms in a single model architecture (Gemma 2-2B), leaving open questions about whether similar mechanisms operate across languages and different transformer architectures. The threshold determination process for circuit introduces some subjectivity, though researchers attempted to mitigate this through systematic sweeps. Additionally, while the study examined eight common idioms, broader investigation across more expressions and linguistic phenomena would help establish the generality of these . Future work could explore whether similar patterns occur for other non-compositional language features or investigate the relationship between these idiom-processing circuits and other known transformer mechanisms.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn