AI Improves How Language Models Read Long Documents

TL;DR

Hierarchical Token Prepending adds summary pathways so AI retrieves and analyzes long texts more accurately, with no extra training required.

Large language models like GPT and Llama have transformed how we interact with text, powering everything from chatbots to search engines. However, a fundamental design quirk limits their ability to produce high-quality text embeddings—numerical representations of text used for tasks like document retrieval and clustering. These models are built for generating text one word at a time, which restricts the flow of information from later parts of a document back to earlier ones. This 'backward flow' problem means earlier tokens cannot integrate context from later positions, degrading the quality of the overall document representation. As a result, while these models excel at generation, their embeddings often underperform in applications requiring a holistic understanding of long texts, such as finding relevant research papers or summarizing complex reports.

Researchers from Carnegie Mellon University, University of Oxford, and Snap Inc. have developed a solution called Hierarchical Token Prepending (HTP), which directly addresses this limitation. HTP enables better backward information flow without requiring any retraining of the model, preserving its zero-shot capabilities. The key insight is that existing s, like Token Prepending (TP), which prepend a single summary token to the input, over-compress information, especially in long documents. HTP replaces this single token with a hierarchy of block-level summary tokens, creating multiple pathways for information to travel backward through the text. This approach mitigates two critical bottlenecks: an attention-level bottleneck where a single token must summarize the entire document, and a readout-level bottleneck where the final embedding is taken solely from the last token, leading to an over-squashed representation.

Ology behind HTP is elegantly simple and operates in three main stages, as illustrated in Figure 4 of the paper. First, the input text is partitioned into semantic blocks, typically sentences, and placeholder summary tokens are inserted. Second, a local prepending step dynamically copies the hidden state of each block's final token to a corresponding local summary token between Transformer layers, creating sentence-level summaries. Third, a global prepending step propagates these local summaries to a block of global summary tokens at the beginning of the sequence, making them accessible to all tokens. This hierarchical structure allows any token to attend to summaries of all subsequent sentences, enabling comprehensive document-level backward flow. Additionally, HTP uses mean-pooling—averaging all token representations—instead of last-token pooling for the final embedding, which theoretical analysis shows is more robust to over-squashing, especially in long contexts.

Extensive experiments across 11 retrieval datasets and 30 general embedding benchmarks demonstrate HTP's effectiveness. In retrieval tasks, HTP consistently outperformed or matched other training-free s like Echo Embedding and Token Prepending. For example, on the BEIR benchmark with models like Mistral-7B and Gemma2-9B, HTP achieved top or runner-up NDCG@10 scores across datasets such as ArguAna, SciFact, and TREC-COVID, as shown in Table 1. In long-context settings from the LongEmbed benchmark, HTP showed strong performance at lengths up to 8192 tokens, outperforming other s and scaling better with document length, as detailed in Table 2 and Figure 5. also improved general embedding tasks like classification and clustering, though it lagged in sentence similarity tasks where fine-grained comparisons benefit from single-token summaries. Ablation studies revealed that the optimal block size (K) depends on document length, with smaller blocks better for short texts and larger blocks for long documents, as seen in Figure 6.

Of HTP are significant for real-world applications where AI needs to process and understand lengthy documents efficiently. By enhancing backward information flow without additional training, HTP makes powerful generative models more effective as universal text encoders for tasks like legal document analysis, academic research retrieval, and content recommendation. It offers a scalable route to superior long-document embeddings, potentially reducing the need for specialized finetuned models in some scenarios. Moreover, HTP's ability to boost performance even in finetuned models like NV-Embed-v2, as shown in Table 4, suggests its utility is orthogonal to existing training approaches, providing a versatile tool for future AI systems. This advancement could lead to more accurate search engines, better automated summarization, and improved AI assistants that grasp nuanced context in extended texts.

Despite its strengths, HTP has limitations. Its performance may not surpass models extensively finetuned for specific retrieval tasks, and further investigation is needed into its interaction with diverse model architectures and training paradigms. The paper notes that while HTP enhances zero-shot embeddings, it is not expected to outperform dedicated retrieval models, and deeper exploration of backward dependency mechanisms is warranted. Additionally, 's effectiveness varies with document length and task granularity, requiring careful tuning of hyperparameters like block size. These limitations highlight areas for future work, but HTP represents a crucial step forward in making large language models more adept at understanding and representing complex textual information.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn