AIResearch AIResearch
Back to articles
AI

AI's Vision Problem in Puzzle Solving

A new study reveals how the way AI systems 'see' data affects their reasoning, with text and image formats creating different blind spots that can be fixed by combining both.

AI Research
March 27, 2026
4 min read
AI's Vision Problem in Puzzle Solving

When AI systems tackle complex puzzles, how they perceive the problem can be just as important as how they reason about it. A new study examining the ARC-AGI benchmarks—challenging tasks designed to test AI generalization through composition—reveals that the choice between text and image representations fundamentally shapes what transformers can 'see' and how accurately they solve problems. This finding has significant for designing more reliable AI systems, as perception errors directly cascade into reasoning failures, undermining performance on tasks that require spatial understanding.

The researchers discovered that different data formats create systematic perceptual bottlenecks that affect feature identification. Text encodings, particularly JSON and ASCII formats, excel at precise coordinate identification for sparse, well-defined features, achieving perfect accuracy on s like 13e47133, 135a2760, and 142ca369. However, these text formats exhibit a one-dimensional processing bias: row-only serializations struggle with vertical alignment (misidentifying Q11 instead of P11 by one column), while column-only formats miss horizontal features (identifying B1 instead of B2 by one row). This directional limitation means that serialization direction determines which spatial relationships the model can reliably reconstruct, with text modalities achieving an average accuracy of 80.4% on sparse single pixels compared to 76.4% for row-only formats.

To isolate perception from reasoning, the study employed a weighted set-disagreement metric across nine text and image modalities. ology involved encoding ARC puzzle grids into formats including row-only, column-only, ASCII, JSON, and five image resolutions (14x14 to 768x768 pixels per cell), then prompting multimodal LLMs to produce detailed descriptions of grid features. These descriptions were compared against ground truth using a systematic workflow that extracted coordinates and colors, calculating accuracy scores based on feature importance weights. For example, in 13e47133, isolated colored dots were weighted at 15 points each, the complex divider line at 20 points, and background color at 5 points, reflecting their criticality for puzzle solving.

, Detailed in Table 1, show that image modalities preserve two-dimensional structure through native representation but introduce patch-size aliasing that corrupts single-cell detail. At low resolutions (14x14–17x17 pixels per cell), multiple grid cells pack into single vision patches, causing hallucinations where single pixels are perceived as larger shapes. At high resolutions, single cells span multiple patches, leading to cross-boundary counting errors and coordinate misreadings when spreadsheet labels straddle patch boundaries. Only optimal resolutions like 24x24-1205 (approximately 1.5 patches per cell for Gemini's 16x16 patch size) prevent critical features from aligning with patch boundaries, achieving 85.00% accuracy on 13e47133 compared to 45.73% for image 16x16. Combining modalities enables cross-validation, improving execution accuracy: multi-modal inputs like row col json image achieved a median similarity score of 0.69 on 13e47133, nearly double that of single-modality encodings and representing an improvement of approximately 0.20 over text-only baselines.

These have practical for designing AI systems that handle spatial reasoning tasks. For s with sparse, clearly delimited features, structured text formats (JSON or ASCII) deliver the highest perception accuracy. For tasks dominated by horizontal or vertical patterns, directional serializations (row-only or column-only) can leverage one-dimensional biases as strengths. Image modalities should be paired with patch-size-aware resolution settings to minimize aliasing, though optimal resolution varies across vision-language model architectures. Most importantly, combining complementary modalities allows models to cross-check coordinates and shapes, mitigating individual failure modes and improving both perception accuracy (by roughly 8 points) and execution reliability without altering the underlying model. This approach aligns with predictive coding principles, where multiple sensory streams help resolve ambiguities and correct errors.

However, the study has limitations that future work should address. The perception experiments tested individual modalities in isolation rather than combined modalities, leaving open how multi-modal inputs directly affect feature identification accuracy. Execution experiments focused on a single (13e47133), and while systematic patterns suggest generalizability, broader evaluation across diverse types would strengthen conclusions. The optimal image resolution is model-specific and depends on the vision encoder's patch size, requiring empirical calibration for each architecture. Additionally, variability in across multiple runs (e.g., 24x24-1148 vs. 24x24-1205) indicates that future work should conduct multiple runs per configuration and test across different LLMs and VLMs to establish statistical significance. Despite these constraints, the research demonstrates that representation choice is a first-order design consideration, offering actionable guidance for enhancing AI robustness in spatial reasoning applications.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn