AI Agents Need Better Testing, Not Bigger Models

TL;DR

New research shows rigorous validation and clear goals matter more than model size for effective AI agents, challenging industry assumptions.

As businesses rush to adopt 'agentic AI' systems that promise to automate complex tasks, a critical question emerges: how can we trust these systems to work reliably in high-stakes environments? New research examines what makes today's AI agents different from previous generations of artificial intelligence and identifies validation as the central challenge facing their deployment.

Researchers propose a 'realist' definition of agentic AI as a delivery mechanism similar to software-as-a-service that performs work autonomously within enterprise settings. Unlike theoretical constructs from AI research, these are practical systems operating within real business processes. The key distinction lies in their capacity to take multiple connected actions within complex sociotechnical environments rather than simply responding to individual prompts.

The study analyzes definitions across multiple perspectives. From classical AI research, intelligent agents act purposefully toward goals. Legal definitions emphasize acting on behalf of a principal with fiduciary responsibilities. Industry definitions, such as NVIDIA's framework, describe systems that perceive, reason, act, and learn. The realist definition bridges these views by focusing on how AI systems actually function within organizational contexts.

A central finding reveals an irony in current AI development: while excitement about agentic AI stems from powerful foundation models like large language models (LLMs), effective systems may actually reduce dependence on these large models over time. When properly specified, agentic systems can achieve their objectives using smaller, more interpretable components that compete on marginal efficiency rather than raw intelligence. The research shows that well-designed systems might employ specialized tools like Markov chains or linear programming instead of relying solely on LLMs.

The paper identifies three major validation challenges that current approaches struggle to address. First, even with high accuracy on individual tasks, errors compound as systems perform sequences of actions—a system with 90% accuracy per step drops to 66% accuracy across just four steps. Second, foundation models trained on general data may not capture the specific knowledge required for particular deployments. Third, performance can degrade over time due to model drift and changing data distributions.

For practical implementation, the research outlines a multi-stage design process that emphasizes validation from the start. Designers should model the complete sociotechnical environment, define clear objectives and constraints, check for potential feedback leaks and perverse behaviors, and implement ongoing monitoring. This approach treats system design as a form of mechanism design where the focus shifts from model capabilities to reliable outcomes.

The implications extend across industries where AI systems handle critical functions. In insurance claims processing, document management, or customer service, the validation gap represents a significant business risk. Systems that appear competent in testing may fail in production due to unanticipated environmental factors or compounding errors.

Current limitations include the vulnerability of LLM-based systems to jailbreaking, hallucinations, and security risks. Research shows these systems often lack confidentiality awareness and may provide only the illusion of reasoning due to limited context windows. The path forward likely involves combining foundation models with specialized, verifiable components rather than relying exclusively on large models.

The research concludes that the future of agentic AI depends less on model scale and more on rigorous validation processes that ensure systems reliably achieve their intended purposes in real-world settings.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn