AI Models Struggle with Middle Memories

Large language models like ChatGPT have a surprising memory quirk: they're much better at remembering information from the beginning and end of conversations than from the middle. This phenomenon, discovered by researchers at Indiana University, mirrors the same 'primacy and recency' effects that affect human memory, where we recall first and last items best while forgetting what came in between.

The research team found that both traditional transformer models and newer state-space models consistently show stronger recall for information presented at the start and end of input sequences. When prompted with sequences containing repeated tokens, the models showed clear preference for predicting tokens that immediately followed the first and last occurrences, while struggling with middle positions. This pattern held true across seven different AI models tested, including Llama, Mistral, and Mamba architectures.

To isolate this temporal bias from semantic content, the researchers used carefully constructed prompts with repeated tokens interspersed with random sequences. By systematically varying the number of repetitions and spacing between tokens while analyzing next-token prediction probabilities, they could measure exactly how position affects recall. The methodology ensured that any observed effects came purely from temporal positioning rather than meaning or context.

The data revealed striking patterns. Figure 2 shows clear peaks in prediction probability corresponding to tokens immediately following the first and last occurrences of repeated tokens. The strength of this 'serial recall' - the tendency to predict the token that originally followed each occurrence - varied by position, with beginning and end positions consistently outperforming middle positions. Figure 3 further demonstrates how this positional bias changes with different numbers of repetitions, showing that some models like Mistral favor recent information while others like Falcon-Mamba prefer early information.

In a second experiment testing episodic memory, researchers examined whether models could distinguish between different 'episodes' - sequences that shared some tokens but had unique context markers. Most models successfully identified the correct target when probed with specific context tokens, but retrieval was strongest for episodes near the end of prompts, showing a clear recency bias. However, models like Mamba and Falcon-Mamba showed less robust retrieval, particularly for earlier episodes.

The researchers also conducted ablation studies to understand the mechanisms behind these effects. By selectively disabling specific components called 'induction heads' in transformer models, they found these components are crucial for the serial recall behavior. When top induction heads were disabled, the probability peaks corresponding to correct token predictions dropped significantly - in some cases by over 9x10^-5 compared to random head ablation. This confirms that induction heads, known for pattern matching in in-context learning, play a key role in temporal processing.

These findings have important implications for how we use AI systems in practice. The 'lost middle' effect means that in long conversations or documents, critical information presented in the middle might be less reliably retrieved by AI assistants. This could affect applications ranging from legal document analysis to medical record processing where complete context matters. The research also suggests that simply switching to newer architectures like state-space models won't necessarily solve this fundamental limitation.

The study's limitations include using simplified token-random prompts rather than natural language, which was necessary to isolate temporal effects but doesn't capture the full complexity of real-world language processing. Future research will need to explore how these temporal biases interact with semantic content in more natural contexts.

AI Models Struggle with Middle Memories

About the Author

Guilherme A.