Higher-order Linear Attention

TL;DR

A new method lets AI models handle long sequences with constant memory while keeping complex data relationships intact, removing a key bottleneck.

Artificial intelligence systems that power today's language models face a critical limitation: their ability to process long sequences of information grows quadratically more expensive as sequences get longer. This computational bottleneck has constrained AI applications from analyzing lengthy documents, conducting extended conversations, or processing continuous data streams efficiently. A new approach called Higher-order Linear Attention (HLA) addresses this fundamental challenge by reimagining how AI systems attend to information.

Researchers have developed a mathematical framework that enables AI models to process sequences with constant memory requirements regardless of length, while maintaining the complex data relationships that make modern attention mechanisms powerful. The key innovation lies in using compact statistics—specifically second and third-order moments—to capture rich interactions between data elements without constructing expensive intermediate matrices.

In traditional transformer attention, the computational cost scales with the square of sequence length (O(n²)), making long sequences prohibitively expensive. HLA achieves linear-time complexity (O(n)) by maintaining running summaries of data relationships. For second-order attention, the system tracks three core statistics: a key-key moment matrix (Sᴷ), a query-value accumulator (Cᴽⱽ), and a query-mass scalar (m). These statistics are updated incrementally as each new token arrives, requiring only O(d² + d·dᵥ) operations per token, where d and dᵥ represent query/key and value dimensions respectively.

The method ensures causal processing—each output depends only on previous inputs—through carefully designed masking operations. The researchers proved mathematical identities that guarantee the masked versions exactly match the desired causal behavior. For training efficiency, they developed a chunk-parallel scheme based on associative scans that allows processing multiple tokens simultaneously while maintaining identical results to sequential processing.

Experimental results show that second-order HLA maintains constant-size state per attention head and computes outputs in linear time relative to sequence length. The approach handles sequences of arbitrary length without materializing any n×n matrices, a critical advantage for long-context applications. The system also supports optional normalization and decay mechanisms for improved numerical stability and recency bias.

The implications extend beyond theoretical interest. This approach enables AI systems to process streaming data, lengthy documents, and extended conversations with fixed memory requirements, making deployment on resource-constrained devices more feasible. It provides a principled alternative to approximate methods that sacrifice expressivity for efficiency, offering exact computation of higher-order interactions.

The current implementation focuses on algorithmic structure rather than application-specific optimizations. While the mathematical foundations are established for second and third-order attention, practical deployment would require integration with existing AI infrastructure and validation across diverse tasks. The method represents a building block rather than a complete solution, leaving room for future work on hardware optimization and application-specific refinements.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn