Idioms and figurative expressions like 'spill the beans' or 'in the long run' are everywhere in casual conversation and online writing, but they continue to pose a significant for artificial intelligence. Despite the vast amounts of data used to train modern language models, these non-literal phrases often elude accurate interpretation because their meaning cannot be deduced simply by analyzing individual words. This gap becomes more pressing as informal language, rich with such expressions, increasingly permeates digital platforms and AI-driven chats, making it essential for machines to understand the nuances of human communication.
The researchers behind this study created new datasets to better evaluate and improve how AI handles idiomatic language. They compiled a list of idioms from existing resources like MAGPIE, FLUTE, EPIE, and LIdioms, then matched these phrases against large text corpora derived from Common Crawl, specifically using the OSCAR and C4 filters. This process yielded one large-scale dataset called PIFL-OSCAR, containing 5.8 million examples and 3,207 unique idioms, along with two smaller, human-annotated datasets named IFL-OSCAR-A and IFL-C4-A. These datasets include not only the sentences with idioms but also additional linguistic features such as part-of-speech tags, BIO labels for sequence tagging, and cosine similarity metrics computed from BERT embeddings to aid in model training and evaluation.
To assess current AI capabilities, the team tested both specialized models and general-purpose large language models (LLMs) on idiom recognition tasks. They used a BERT-based sequence labeling architecture, where the model identifies and tags idiomatic expressions within sentences, and compared it against several open-weight chat LLMs including Gemma-3, Llama-3.1, GPT-OSS, Mistral, Phi-4, and Qwen-2.5. The evaluation focused on metrics like sequence accuracy, precision, recall, and F1 score, with models trained and tested across the new and existing datasets to measure performance and generalizability.
Highlight a stark contrast between fine-tuned models and general LLMs. The BERT-based models achieved strong performance, particularly on the MAGPIE dataset where they reached a sequence accuracy of 92.97%, setting a new state-of-the-art benchmark. In contrast, the chat LLMs performed poorly across the board, with average sequence accuracies as low as 0% on some datasets and frequent errors in labeling. For example, models often mislabeled phrases by over-extending tags, missing the idiom entirely, or incorrectly tagging non-idiomatic words. Cross-evaluation showed that models trained on the large PIFL-OSCAR dataset generalized best to other datasets, while those trained on human-annotated data like IFL-OSCAR-A did not transfer as effectively, indicating the value of diverse, large-scale training data.
This work underscores the ongoing difficulty AI faces with figurative language, which is crucial for applications like machine translation, content moderation, and virtual assistants that need to grasp colloquial speech. The datasets introduced provide a valuable resource for future research, enabling more robust testing and development of models that can better handle idiomatic expressions. However, limitations remain, such as the reliance on lexical matching which may include false positives, and the potential presence of machine-generated text in the source corpora. The researchers suggest that future improvements could involve adding confidence metrics for idiomaticity or expanding to multilingual versions to enhance model adaptability across languages.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn