AI Models Now Rewrite Their Own Thinking Mid-Task

TL;DR

Researchers found AI systems can revise their internal reasoning steps mid-process, leading to more accurate answers on complex tasks.

In the relentless pursuit of more capable artificial intelligence, researchers have long focused on scaling up models and their computational power. Large reasoning models (LRMs) that leverage reinforcement learning (RL) with rewards for correct final answers have shown impressive success on complex tasks. However, a critical flaw has emerged: this one-sided focus on outcome correctness provides no detailed supervision over the internal reasoning process itself. This deficiency leads to what researchers term "internal reasoning flaws"—problems like over-thinking trivial details, under-thinking complex aspects, redundant repetition of ideas, and disordered, incoherent thought sequences. These flaws compromise interpretability, waste computational resources, and can even degrade final performance. A new paper from researchers at Beijing Institute of Technology, Zhejiang University, and ByteDance introduces a novel solution: teaching AI models to rewrite and improve their own reasoning texts, a they call "self-rewriting."

The core ology, detailed in the arXiv preprint "Incorporating Self-Rewriting into Language Model Reasoning Reinforcement," builds upon the established GRPO (Group Relative Policy Optimization) RL framework. Instead of merely rewarding a correct final answer, the self-rewriting framework instructs the model to act as its own editor. For a given query, the model first generates its standard reasoning passage. Then, in a selective process, it rewrites that passage with the goal of enhancing overall quality—making it more organized, coherent, and accurate while preserving core ideas. The model is then prompted to generate a final answer from this refined reasoning. Crucially, the learning algorithm gives higher reward to these rewritten, correct responses, incentivizing the model to internalize the qualities of better reasoning. To maintain efficiency and not disrupt learning on harder problems, the system employs "selective rewriting," only applying this process to "simple" queries where the model's initial answers are already consistently correct.

From extensive experiments are compelling. The team tested their approach on diverse reasoning tasks—mathematics (MATH-500), science (GPQA-Diamond), logic (ARC-Challenging), and knowledge (MMLU-Pro)—using Qwen3 models of varying sizes (1.7B, 4B, and 8B parameters). In the critical trade-off between accuracy and reasoning length, self-rewriting achieved superior performance. For instance, the Qwen3-8B model with self-rewriting maintained high accuracy (89.2% average) while slashing reasoning length by 46% compared to the original model, outperforming strong baselines like explicit length-penalty s. More importantly, evaluation using powerful LLMs as judges to score internal reasoning quality showed self-rewriting achieved significantly higher scores (+7.2 points on average for Qwen3-8B), successfully mitigating the identified flaws of over-, under-, redundant-, and disordered-thinking.

Of this work are significant for the future of AI development. First, it demonstrates a path beyond simple outcome-based training toward models that learn to produce higher-quality, more human-like reasoning traces, which is vital for interpretability and trust. Second, by generating more concise reasoning without explicit instructions to do so, offers a path to greater computational efficiency during inference, a major concern for deploying powerful models. The framework's flexibility is also key; while this research used a general "improve quality" prompt, the approach could be adapted with targeted instructions to produce reasoning with specific styles or for particular applications, from educational tools to technical analysis.

Despite its promise, the approach has limitations. The current evaluation of internal reasoning quality, while innovative, relies on "LLM-as-a-judge" metrics, which, though powerful, are an imperfect proxy for human assessment. also introduces about a 10% computational overhead during training, though the authors argue this is acceptable for post-tuning. Furthermore, the research focuses on general rewriting; the exploration of how specialized, application-specific rewriting prompts affect performance is left for future work. Nonetheless, by extending the paradigm of self-rewarding AI, this research marks a substantive step toward models that can not only solve problems but also learn to refine and improve the very process by which they think.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn