Chess has long been a benchmark for artificial intelligence, but new research reveals that even the most advanced large language models (LLMs) struggle with fundamental aspects of the game. A comprehensive evaluation called ChessQA, developed by researchers at the University of Toronto, shows that these models often fail at tactical calculations and strategic judgment, despite excelling at basic rule recognition. This gap highlights critical weaknesses in AI reasoning that could impact real-world applications where complex decision-making is required.
The key finding is that LLMs perform inconsistently across different levels of chess understanding. While they achieve up to 97% accuracy on structural tasks like identifying legal moves or piece arrangements, their performance drops sharply to as low as 17% on tactical problems requiring short, calculated sequences. For position judgment—evaluating long-term advantages—even top models like GPT-5 barely surpass random guessing levels, indicating a fundamental lack of planning depth.
Methodology for ChessQA involved constructing a benchmark with 3,500 items across five categories: Structural (basic rules), Motifs (pattern recognition), Tactics (short calculations), Judgment (position evaluation), and Semantic (describing concepts). Each category tests progressively abstract skills, mirroring how human players develop expertise. The researchers evaluated 15 LLMs, including GPT-5, Claude Sonnet, and Gemini 2.5, using standardized chess notations like FEN and UCI to ensure fair, zero-shot testing without model fine-tuning.
Results analysis from the paper shows that enabling chain-of-thought reasoning improves performance by an average of 14.7 percentage points, but at a high computational cost—models consumed up to 14,823 tokens per task. Error analysis identified common failure modes: hallucinating piece positions, making legality mistakes in tactics, and incorrectly concluding "no answer" exists for solvable puzzles. These errors persist even in state-of-the-art models, underscoring that raw scale alone doesn't solve reasoning deficiencies.
Contextually, these limitations matter because chess serves as a microcosm for broader AI challenges. If models can't reliably navigate a constrained environment with clear rules and objectives, their applicability to dynamic real-world scenarios—like medical diagnosis or financial planning—remains questionable. The benchmark's design allows ongoing evaluation as models improve, providing a persistent measure of reasoning progress.
Limitations noted in the study include the benchmark's current focus on static positions rather than full games, and the inherent difficulty of scaling reasoning efficiency—human chess masters don't need thousands of tokens to solve puzzles. Future work will expand the dataset and increase difficulty to maintain challenge as models evolve, ensuring ChessQA remains a relevant diagnostic tool for AI development.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn