AI Cannot Grasp Humor in News Headlines, Study Finds

TL;DR

AI can rate and compare humor in edited news, but it misses sarcasm and cultural references, limiting its real-world use.

Humor is a key part of human communication, and as artificial intelligence aims to emulate human intelligence, developing systems that can recognize and generate humor is crucial. A recent study introduces the Humor Edited News Headlines task, which challenges AI to assess and predict humor in short, edited news headlines. This research is significant because it addresses a fundamental gap in AI's ability to understand nuanced human traits, which could impact applications like content generation, chatbots, and social media analysis. For non-technical readers, this matters because it shows how AI, while advancing, still falls short in grasping the subtleties that make us laugh, highlighting the complexity of human-like intelligence.

The key finding from this study is that AI systems can now rate the funniness of edited news headlines on a scale from 0 to 3 and predict which of two edited versions is funnier, but their performance is inconsistent. Researchers discovered that while some systems achieved high accuracy, they often struggled with humor that relies on sarcasm, cultural context, or world knowledge. For example, in one case, an AI system failed to recognize that substituting 'billions' with 'pennies' in a headline about tax savings could be humorous due to the ironic contrast, underscoring the limitations of current models.

The methodology involved creating a dataset called Humicroedit, which contains over 15,000 news headlines collected from Reddit subreddits like r/worldnews and r/politics. Human editors on Amazon Mechanical Turk made micro-edits to these headlines by replacing a single word—such as a noun, verb, or entity—to make them funny, and other judges rated the funniness on a 0-3 scale. This approach simplified the analysis by focusing on small, atomic changes, allowing researchers to study humor at a granular level. The task was split into two subtasks: one for regression (predicting the exact funniness score) and another for classification (determining which of two edited headlines is funnier).

Results from the study show that the best-performing system for the regression subtask, developed by Hitachi, achieved a root mean squared error (RMSE) of 0.497, a 13.5% improvement over the baseline that assumed an average rating. For the classification subtask, the top system reached an accuracy of 67.43%, compared to the baseline accuracy of 49.5%. Analysis of the data, as referenced in Figure 4, indicates that AI systems performed better when the funniness gap between two headlines was larger, but they had higher errors at the extremes of the funniness scale (very funny or not funny at all). For instance, in examples like R1 and R2 from the paper, systems underestimated humor in headlines involving cultural references or tension relief, such as replacing 'wrestle' with 'kangaroo' in a news item about arrests.

In a broader context, this research matters because it demonstrates the potential for AI to assist in creative tasks, such as generating humorous content for advertising or entertainment, but also reveals current limitations. For everyday readers, this means that while AI can help filter or enhance content, it is not yet reliable for tasks requiring deep cultural understanding or emotional nuance. The study's large participation—48 teams for one subtask and 31 for the other—highlights the growing interest in computational humor and its applications in improving human-AI interactions.

However, the study acknowledges limitations, including the dataset's quirks, such as low agreement among human judges in some cases and biases from frequent mentions of specific topics like politics. The paper notes that humor is subjective and non-binary, making it difficult to collect reliable data without high costs. Additionally, AI systems struggled with recognizing sarcasm and irony, as seen in examples where substitutions that should have been funny were misclassified. Future work could focus on specific forms of humor, like incongruity or puns, and improve world knowledge in AI models to address these gaps.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn