As large language models (LLMs) become integral to applications across technology, healthcare, finance, and education, organizations face a growing : managing multiple AI models and providers efficiently. Production workloads are diverse, tasks evolve rapidly, and failures often cluster in specific traffic subsets rather than being evenly distributed. No single model or provider is optimal for all cases, and costs can vary by orders of magnitude. Teams increasingly rely on multi-model compositions, routing cheaper models to simpler tasks and stronger ones to harder ones, but this introduces compounding complexity in evaluation and routing. Current approaches, such as LLM-as-judge evaluation, often produce unstructured outputs or single scores that lack fine-grained diagnostics, while routing systems operate as black boxes without interpretable explanations. This gap hinders quality-aware routing and operational transparency in live services.
The researchers developed SEAR (Schema-Based Evaluation and Routing), a system that addresses these limitations by creating a unified, SQL-queryable data layer for evaluating LLM responses and routing requests. SEAR defines an extensible relational schema with around one hundred typed columns across four semantic evaluation tables and one gateway metrics table. The semantic tables cover context signals (like user intent and task type), response characteristics (such as tool invocation or code generation), issue attribution (identifying causes of problems), and quality scores (including relevance and factual accuracy). These tables are linked by foreign keys, enabling cross-table consistency checks to detect errors. The gateway metrics table logs operational data—latency, cost, throughput, and error rates—for every request handled by the LLM gateway. By co-locating evaluation signals and operational metrics, SEAR allows teams to analyze response quality and performance through standard SQL queries, forming a data flywheel where accumulated signals drive routing updates.
To populate the evaluation signals reliably, SEAR employs a schema-driven judge that uses self-contained signal instructions, in-schema reasoning, and multi-stage generation. Each signal column has a detailed description specifying its definition, evidence scope, value-assignment rules, and edge cases, reducing confusion between adjacent signals. Instead of generating free-text reasoning in separate calls, the judge includes a temporary reasoning field within the JSON schema output, allowing it to think step-by-step before committing to structured values in a single call per table. This in-schema reasoning improves accuracy without doubling token costs. The generation process is divided into stages: context information first, then response characterization, issue attribution, and quality scoring, with each stage receiving the conversation context and upstream structured outputs. This multi-stage approach ensures stability, as generating all around one hundred columns in one call often produces malformed JSON, whereas smaller schemas per call yield reliable .
Across 3,000 production sessions sampled from three organizations with distinct workloads—multilingual, roleplay, and translation-heavy—SEAR demonstrated strong performance. Using GPT-5-mini with high reasoning effort and in-schema reasoning, the judge achieved accuracy rates of 91.9% for boolean signals, 92.3% for categorical signals, and 86.0% for ordinal signals in the evaluation table. Cross-table consistency checks flagged only 0.7% of records as inconsistent, indicating high reliability. In a routing case study focusing on simple-complexity tasks, SEAR identified a substitute model, gemini-2.5-flash-lite, that offered 90% lower input cost and 92% lower output cost compared to the deployed model, claude-haiku-4-5, while maintaining comparable quality. Manual verification on 100 replayed sessions showed a 48% win rate for the routed model, effectively tying quality at significantly reduced expense. Additionally, a lightweight context classifier using GPT-5-nano achieved 82.6% boolean accuracy for real-time routing signals, though with higher error rates than the full judge, enabling low-latency decisions.
Of SEAR are substantial for organizations deploying LLMs at scale. By providing fine-grained, interpretable signals derived through LLM reasoning rather than shallow classifiers, SEAR enables human-understandable routing explanations and targeted root-cause diagnosis. Teams can now make data-driven routing decisions that balance quality, cost, and latency, reducing operational expenses without compromising performance. For example, queries can identify the cheapest model within 10% of the best quality or rank providers by median latency while ensuring task appropriateness. This transparency is critical in production settings, where routing changes affect live services and require clear justifications. Moreover, SEAR's extensible schema supports evolving needs, allowing new tables or columns to be added without disrupting existing workflows, fostering continuous improvement in AI governance and efficiency.
Despite its strengths, SEAR has limitations. The evaluation relies on sampled traffic—only a fraction of requests are judged due to cost constraints—which may miss rare failure modes. The routing case study covered a limited set of models and task types, specifically simple-complexity slices, so may not generalize to broader workloads. Meta-task confusion, where the judge conflates its own evaluation instructions with the user's task, occurred in 3.7% of sessions with GPT-5-mini at high effort, though this decreased to 0% with GPT-5.2 at high effort, indicating that increased inference-time compute mitigates the issue. Real-time context classification with lightweight models showed higher error rates, suggesting a trade-off between latency and accuracy. Future work aims to expand data collection, test with non-GPT models, and conduct end-to-end online routing experiments to validate SEAR's effectiveness across diverse scenarios.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn