Artificial intelligence systems designed for web interaction face significant challenges when navigating dynamic, real-time environments, according to a new evaluation of OpenAI's ChatGPT Atlas. While these systems show promise for straightforward tasks like information retrieval, their performance in complex interactive scenarios reveals critical limitations that could impact real-world applications.
The study examined Atlas's capabilities across five web-based games representing different cognitive demands: Google's T-Rex Runner, Flappy Bird, 2048, Sudoku, and the MMORPG Stein.world. Researchers conducted systematic trials using the Agent Mode (Preview) feature, providing minimal instructions and observing how the system handled unpredictability and timing requirements.
Testing occurred under standard conditions using the October 2025 release of ChatGPT Atlas on macOS Sonoma 14.6.1. The evaluation employed zero-shot protocols where researchers navigated to target URLs, enabled Agent Mode, and provided basic instructions like "Try to play the game. Stop when you get stuck." Performance was measured against established human baselines for each game.
The results revealed a clear performance dichotomy. Atlas excelled at logical reasoning tasks, completing medium-difficulty Sudoku puzzles in an average of 2 minutes 28 seconds—roughly 4.5 times faster than human baselines of 10-12 minutes. The system employed systematic constraint identification and mathematical modeling to solve puzzles efficiently.
However, Atlas struggled dramatically with real-time coordination. In Google's T-Rex Runner, the system achieved only 11.7% of human baseline performance, averaging 45.5 points compared to 388.9 points. It failed to pass the first obstacle in 9 out of 10 trials due to consistent late jumping timing. Similarly, in Flappy Bird, Atlas failed to pass any pipes across multiple attempts, showing erratic and uncoordinated tapping without the rhythmic precision needed for success.
In exploration-intensive games, Atlas demonstrated awareness of difficulties but limited adaptive capacity. During 2048 gameplay, the system discovered game mechanics through exploration but then executed repetitive "swirling" movement patterns totaling approximately 40 directional moves without developing strategic tile-merging approaches. It typically stalled at the 64-tile level, far below expert human performance.
The MMORPG Stein.world revealed additional limitations in contextual understanding and goal pursuit. Despite extended attempts and detailed instructions, Atlas failed to complete basic narrative tasks. The system located the starting room but couldn't navigate to required NPCs, spending considerable time deliberating between control methods rather than efficiently executing commands.
These findings matter because web-based AI agents are increasingly deployed for real-world tasks requiring dynamic interaction. The performance gaps in timing precision, motor coordination, and strategic adaptation suggest current systems may struggle with applications requiring real-time decision-making or sustained goal-directed behavior beyond structured problem-solving.
The study acknowledges limitations including small sample sizes and preliminary nature of observations. Researchers plan to expand evaluation to include more complex applications and comparative analysis with other AI systems. Future work will focus on developing testing protocols to precisely identify failure points and investigating targeted training approaches to enhance performance in interactive environments.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn