AI Agents Fail to Copy How Humans Forget

TL;DR

A new study finds that advanced AI models cannot replicate the subtle memory errors humans make, giving researchers a new way to detect bots.

As artificial intelligence becomes more capable, distinguishing human participants from AI bots in online studies is increasingly challenging. Traditional s like logic puzzles or attention checks are now easily solved by large language models (LLMs), threatening the validity of behavioral research. Researchers from Princeton University have turned this problem on its head by probing for a uniquely human trait: the limitations of working memory. Their study demonstrates that even when AI models are explicitly instructed to mimic human cognitive constraints, they still produce detectable anomalies in how they recall information, providing a new tool to ensure online experiments involve real people.

The key finding is that LLMs, despite their advanced capabilities, cannot authentically replicate the error patterns humans exhibit in memory tasks. In a classic serial recall experiment where participants had to remember letters from lists of varying lengths, human participants showed typical memory constraints: they remembered items at the beginning and end of lists better than those in the middle, and their accuracy dropped as list length increased. In contrast, LLMs given a simple instruction to behave like humans achieved near-perfect accuracy, with a median of 1.00 across trials, as shown in Figure 2. This stark difference makes it easy to detect such models using a simple accuracy threshold, though the researchers note that more sophisticated AI attempts require deeper analysis.

Ology involved comparing 100 online human participants recruited through Prolific with 55 LLM participants from models like GPT-5, Gemini, and Claude. The task was a probed recall working memory paradigm, where participants saw sequences of letters and had to recall specific ones based on position or successor probes, as illustrated in Figure 1. To test the robustness of detection, the researchers ran two simulations: one with LLMs given a basic prompt to act human, and another with LLMs given detailed instructions to mimic working memory limitations or finetuned on human psychology data, such as the Centaur model. They also implemented measures like AI self-reports and language-based questions to ensure human participants weren't using AI, though none were detected.

From the study, detailed in Figure 2 and Figure 3, reveal that LLMs instructed to mimic working memory constraints (LLM-WM) or finetuned (Centaur) could qualitatively reproduce human-like patterns, such as primacy and recency effects, but their quantitative profiles differed. Using a hierarchical Bayesian logistic regression model, the researchers analyzed working memory profiles and found that human participants had lower baseline memory capacity and stronger negative load effects compared to LLMs. For anomaly detection, they trained the model on a subset of human data and scored new participants based on log predictive densities. As shown in the ROC curve in Figure 3, this approach could detect LLM-WM and Centaur participants with an AUROC varying by model, as listed in Table 1, though it required a trade-off between false positives and false negatives.

Of this research extend beyond academic studies to broader societal s in verifying humanness online. As LLMs become more integrated into digital interactions, institutions that rely on behavioral cues to confirm human presence may need to adopt cognitive science-based detection s. The study suggests that probing established human constraints, like working memory limits, offers a viable path forward, especially when combined with multiple measures to avoid bias. However, the authors caution that as AI models are trained on more human psychology data, detection may become harder, potentially incentivizing the development of more human-like AI, which could itself be a scientific achievement.

Limitations of the approach include its reliance on specific task instantiations that could be circumvented if AI models are trained on similar data. The researchers note that varying stimulus timing or probe types could help maintain detection effectiveness, but this requires collecting new human data and keeping it private to prevent AI adaptation. Additionally, the study did not use reaction times for detection due to variability in AI response times and the potential for artificial manipulation. Future work could explore other cognitive domains, but the independence assumption in the Bayesian model may limit anomaly detection power, and the joint predictive density was less effective than pointwise measures in this case.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn