A new study reveals that small language models, when working together in teams, can match or even surpass the performance of much larger models in evaluating the correctness of answers to complex reasoning questions. This finding s the assumption that bigger models are always better for judgment tasks, pointing toward more efficient and accessible AI evaluation systems. The research introduces JudgeBoard, a novel pipeline that directly assesses answer correctness without relying on comparison-based s, and a Multi-Agent Judging (MAJ) framework that leverages collaborative deliberation among small models.
The researchers discovered a significant performance gap between isolated small language models (SLMs) and large language models (LLMs) in judgment tasks, but this gap narrows dramatically when SLMs work collaboratively. On the MATH dataset, which includes high-school-level competition questions, the MAJ framework using Qwen3-14B as a backbone outperformed the best-performing LLM, Qwen3-30B-A3B, by an average of 2% in judging accuracy across categories like Algebra, Number Theory, and Counting and Probability. This demonstrates that smaller models, through structured debate and interaction, can achieve evaluation capabilities comparable to their larger counterparts, potentially reducing computational costs and democratizing model assessment.
Ology centered on JudgeBoard, a four-stage pipeline that collects candidate answers from student models, gathers independent judgments from judge models, conducts pairwise competitions, and constructs leaderboards using both accuracy-based rankings and an Elo-style rating system. For the MAJ framework, multiple SLM agents were configured with distinct reasoning profiles—such as Deductive Reasoner, Logical Reasoner, and Robust Reasoner—to promote diverse analytical behavior. These agents engaged in structured multi-turn debates, critiquing each other's reasoning and revising judgments before reaching a collective decision through majority voting. Experiments spanned mathematical reasoning datasets like MATH, GSM8K, and OmniMATH, as well as science reasoning datasets like ARC-and GPQA, with models ranging from 0.6B to 72B parameters.
From the JudgeBoard leaderboards, detailed in tables like Table 1, show that LLMs generally outperformed SLMs in isolated judging, but newer SLMs like Qwen3 and Gemma3 models often surpassed older LLMs. For instance, on the MATH dataset, Qwen3-14B achieved an overall accuracy of 0.6684 in Algebra, while older models like Mixtral-8x7B scored only 0.5316. A key finding was that poorly performing models tended to agree with student answers regardless of correctness, highlighting the importance of metrics like Student Wrong accuracy, which measures the ability to identify incorrect answers. The MAJ framework, as shown in Table 2, enabled smaller Qwen3 models (4B, 8B, and 14B versions) to perform comparably to larger Qwen3 models (30B and 32B versions), with Qwen3-14B achieving an overall accuracy of 0.7789 in Algebra through multi-agent debate.
Of this research are substantial for real-world applications, offering a path to more scalable and cost-effective AI evaluation. By enabling small models to collaborate effectively, the MAJ framework could reduce reliance on expensive large models for tasks like grading educational content, verifying scientific reasoning, or assessing AI-generated outputs. This approach also enhances interpretability, as the debate process provides insights into how judgments are formed. For everyday users, it means that AI systems could become more efficient and transparent, potentially lowering barriers to deploying advanced reasoning tools in resource-constrained environments like schools or small businesses.
However, the study acknowledges several limitations. The evaluation covered only a subset of data points from each dataset, introducing sampling variance and imbalanced splits, particularly in complex datasets like GPQA and OmniMATH where student models rarely answered correctly. The effectiveness of the MAJ framework varied across model families and was sensitive to prompt design and profile selection, with inconsistent for models like Phi4 and Gemma3. Additionally, the benchmark focused solely on mathematical and science reasoning, leaving other domains like long-form coherence or safety evaluations unexplored. Future work could extend MAJ to broader reasoning areas, investigate adaptive agent profiling, or integrate it with fine-tuning approaches to create specialized judge models.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn