AI Reasoning Scales in 3D to Solve Olympiad Problems

TL;DR

A new framework boosts AI reasoning by combining three scaling methods, hitting 86.7% accuracy on hard math problems and enabling precise robot control.

Artificial intelligence models that reason step-by-step, like OpenAI's o1 and DeepSeek-R1, have shown a curious property: they get better at solving problems when given more computational resources during testing, not just during training. This phenomenon, known as test-time scaling, suggests that AI's reasoning ability can be enhanced on the fly. However, this potential has been bottlenecked by the limited context windows of current models, which allow for far fewer tokens during inference than the trillions consumed in training. A new study from Tsinghua University and New York University tackles this limitation head-on by proposing a unified framework that expands test-time scaling into three dimensions, dramatically extending what AI can achieve with extra compute at the moment of problem-solving.

The researchers discovered that test-time scaling isn't a one-dimensional affair. By analyzing existing techniques, they identified three distinct axes where performance improves with increased token usage: context scaling (longer reasoning chains), batch scaling (generating multiple parallel solutions), and turn scaling (iterative self-refinement). Each dimension individually shows a scaling effect, where accuracy climbs as more computational budget is allocated, but each also hits a clear ceiling. For instance, on International Mathematical Olympiad (IMO) problems, extending context length or increasing batch size initially boosts accuracy, but gains plateau or even decline beyond a point, as shown in Figure 2 of the paper. This saturation indicates that relying on any single dimension alone is insufficient to unlock the full potential of test-time compute.

Ology involves a comprehensive framework called 3D test-time scaling, which integrates all three dimensions. The researchers used Gemini 2.5 Pro as the base reasoning model and tested it on challenging benchmarks including IMO, Chinese Physics Olympiad (CPHO), and International Olympiad in Informatics (IOI) problems. In the 3D setup, the model operates over multiple turns (T), generating a batch of independent responses (B) per turn within a specified context length (C). At each turn, an aggregation function—either an LLM judge or a human judge—selects the best and worst responses to inform the next iteration. This process allows the model to refine its reasoning iteratively while exploring diverse solution paths, effectively combining the strengths of parallel sampling and sequential refinement.

Demonstrate that 3D scaling substantially outperforms single-dimension approaches. On IMO 2025 problems, 3D scaling with an LLM judge achieved an average accuracy of 73.3%, surpassing the limits of individual scaling s, as detailed in Figure 4. When human feedback was incorporated as the judge, accuracy rose to 86.7%, highlighting the power of human-in-the-loop integration. Similarly, on CPHO 2022, 3D scaling with a human judge reached 70% accuracy, compared to 53.3% for basic context scaling. For IOI 2025 coding problems, the human-judged 3D scored 221.53 points, approaching the bronze medal cutoff and showing a 19.9% improvement over batch scaling alone. The framework also extended to embodied learning tasks, where it enabled a humanoid robot to learn human-like jumping behaviors through iterative reward function refinement, with 17 out of 20 human volunteers preferring its outcomes over baseline s.

Of this work are profound for both AI research and practical applications. By framing test-time enhancement as a multi-dimensional scaling problem, the study provides a systematic way to amplify AI reasoning capabilities without retraining models. This could lead to more reliable AI assistants in education, science, and programming, where complex problem-solving is essential. The human-in-the-loop aspect further bridges AI with human expertise, allowing for collaborative refinement in domains like robotics and creative tasks. The extension to embodied learning, demonstrated through the HumanoidJump task, shows how this framework can tackle open-ended s where predefined rewards are inadequate, paving the way for more adaptive and intuitive AI systems.

Despite these advances, the study acknowledges several limitations. Each scaling dimension has bounded capacity, and the researchers note that performance can degrade if parameters are pushed too far—for example, batch scaling with majority voting can amplify model biases, leading to accuracy drops as batch size increases, as proven in Theorem 1. The framework also relies on the base model's ability to judge its own responses, which may falter in highly complex tasks. Additionally, while human feedback boosts performance, it introduces scalability and cost s. The paper concludes by posing open questions about whether other scaling dimensions exist beyond context, batch, and turn, suggesting that further exploration could unlock even greater reasoning potential in AI systems.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn