AIResearch AIResearch
Back to articles
AI

AI Adapts Its Own Thinking Speed for Better Answers

A new method lets AI models decide how many words to process at once, improving accuracy in reasoning and coding tasks without extra training—and it works by reading the model's own attention patterns.

AI Research
April 01, 2026
3 min read
AI Adapts Its Own Thinking Speed for Better Answers

Diffusion language models, a type of AI that generates text through iterative refinement, offer a promising alternative to traditional autoregressive models by allowing parallel updates of multiple tokens. However, a critical has been determining how many tokens to update simultaneously—a decision that balances speed against accuracy. Small updates slow down the process, while large ones risk producing inconsistent or incorrect text. Researchers from Xi’an Jiaotong University have developed GeoBlock, a framework that addresses this by enabling the AI to dynamically adjust its update granularity based on the internal dependency structure of the text, leading to more reliable outputs without requiring additional training.

The key finding is that GeoBlock infers block boundaries—the size of token groups updated in parallel—directly from the model's attention patterns during decoding. Instead of relying on fixed schedules or heuristic signals like token confidence, GeoBlock analyzes cross-token dependency geometry to identify regions where tokens can be safely refined together. This approach ensures that updates respect the logical and semantic relationships within the text, such as causal ordering in reasoning tasks or cohesive spans in descriptions. consistently improves accuracy across benchmarks like GSM8K for math reasoning and HumanEval for code generation, with only a modest increase in computational cost, typically around 11% more function evaluations.

Ology involves a training-free decoding framework that formulates block selection as a boundary inference problem. At each refinement step, GeoBlock examines the self-attention matrix produced by the model, which serves as a proxy for dependency structure. It computes a closure score for candidate regions by balancing internal coupling, past anchoring, and future leakage—quantities derived from attention mass between token sets. For example, strong internal coupling and minimal dependence on unresolved future tokens indicate a region suitable for parallel updates. The framework aggregates observations from multiple attention layers and heads, using a weighted sum to create a unified dependency estimate, and selects the largest block whose score remains within a tolerance of the maximum, as detailed in Algorithm 1 of the paper.

From extensive experiments demonstrate GeoBlock's effectiveness. On models like Dream-7B and LLaDA-8B, GeoBlock achieved the highest or comparable accuracy on benchmarks including GSM8K, MATH, IFEval, HumanEval, and MBPP, often outperforming baseline s like dynamic confidence-based decoding. For instance, on GSM8K with LLaDA-8B, GeoBlock reached 81.88% accuracy with 100.05 NFE, compared to 82.11% with 88.59 NFE for dynamic decoding at matched settings, showing a favorable trade-off. The average inferred block length ranged from 13 to 19 tokens, indicating adaptive granularity, and additional computational overhead was modest, with extra NFE ratios mostly between 7-15%. Ablation studies confirmed that components like the anchoring coefficient and right-shift tolerance are crucial for optimal performance, with values like 𝛼=0.5 and 𝛿=0.1 yielding the best accuracy-efficiency balance.

Of this research are significant for practical AI applications, as it enables more efficient and reliable text generation in areas like automated reasoning, code synthesis, and long-form content creation. By adapting update granularity to dependency geometry, GeoBlock reduces the risk of errors in complex tasks without sacrificing speed, making diffusion models more viable for real-world use. 's training-free nature means it can be integrated seamlessly into existing systems, potentially enhancing tools for education, programming assistance, and content generation. However, the paper notes limitations, such as the reliance on attention patterns that may not fully capture all dependencies, and the need for further exploration in diverse domains beyond the tested benchmarks. Future work could address these by extending the approach to other model architectures or more complex linguistic structures.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn