AI Learns to Think Without Needing Answers

Artificial intelligence systems often struggle with complex reasoning in areas where answers aren't clearly right or wrong, limiting their real-world applications. A new study demonstrates how AI can develop sophisticated thinking abilities across diverse domains—from mathematics to creative writing—without requiring verifiable correct answers during training. This advancement could enable more capable AI assistants for education, research, and daily problem-solving.

The researchers developed General Zero-RL, a reinforcement learning method that improves large language models' reasoning capabilities across both verifiable domains (like mathematics with clear answers) and non-verifiable domains (like open-ended questions without definitive solutions). Their approach combines reward signals from different sources: for mathematical problems, they used accuracy-based rewards comparing responses to ground truth, while for general tasks, they employed a generative reward model that evaluates response quality on a scale from -5 to 5.

The methodology builds on existing reinforcement learning techniques but introduces key innovations. The team used Group Relative Policy Optimization (GRPO) with modifications including token-level policy gradients instead of sequence-level loss. They trained models on a blended dataset containing approximately 178,535 math-related examples, 125,798 STEM samples from WebInstruct, and 36,125 general conversation samples from ShareGPT. Crucially, they implemented a smooth length penalty that prevents models from producing excessively verbose responses without substantive reasoning—a common problem in reward-based training where models learn to maximize rewards through verbosity rather than genuine reasoning.

Experimental results show significant improvements. On mathematical reasoning benchmarks, their General Zero-RL model based on Qwen3-14B-Base achieved 92.4% accuracy on MATH-500, 59.7% on AIME24, and 38.2% on AIME25, outperforming comparable models including DeepSeek-R1-Zero-Qwen-32B. For general reasoning tasks, the same model reached 56.1% on MMLU-Pro, 58.0% on GPQA-Diamond, and 45.3% on SuperGPQA. Most notably, on general tasks like creative writing and conversation (evaluated using Arena-Hard, WritingBench, WildBench v2, and AlpacaEval2.0), their models generated coherent, meaningful content that significantly outperformed models trained exclusively on verifiable domains.

The practical implications are substantial. This approach enables AI systems to develop reasoning abilities that transfer across domains, meaning skills learned in mathematics can improve performance in writing, conversation, and complex problem-solving. For everyday users, this could lead to more helpful AI assistants capable of nuanced understanding and reasoning across diverse topics, from helping students with homework to assisting professionals with complex analysis tasks.

The study acknowledges limitations, including that the research didn't explore programming domains, which require specialized verification methods like code execution sandboxes. Additionally, the models weren't tested against Qwen3-Instruct's thinking mode, which uses supervised fine-tuning for chain-of-thought reasoning. The researchers note that exploring how to integrate code-related data into their multi-task framework represents an important direction for future work.

AI Learns to Think Without Needing Answers

About the Author

Guilherme A.