AI Routing Works Better When System Design Comes First

TL;DR

A new study shows how AI systems package decisions matters more than model choice, and poor setup can break critical workflows.

When artificial intelligence systems route requests between different functions—like deciding whether a user needs customer support or technical documentation—the conventional wisdom has focused on choosing the right language model. But new research shows that a hidden layer of system design choices can make or break these routing decisions, with consequences for everything from customer service to specialized business applications. The study demonstrates that how AI systems package their structured outputs matters just as much as which model generates them, revealing that no single approach works universally across different AI backends.

Researchers from Universiti Sains Malaysia conducted a comprehensive benchmark covering 48 deployment configurations and 15,552 routed requests across three major AI backends: OpenAI, Gemini, and Llama. Their central finding s common assumptions: there is no universal best way to structure routing decisions. Instead, the interaction between backend systems and runtime packaging choices dominates performance. Modes that remain highly reliable on Gemini and OpenAI can suffer substantial correctness degradation on Llama, while efficiency gains from compressed output formats are strongly backend-dependent. This means that system designers cannot simply copy successful configurations from one platform to another without risking serious performance drops.

The study reframed structured routing as a "runtime burden-allocation problem," examining how structural work gets distributed across the AI generation stack. Researchers tested four different runtime modes that varied along three key dimensions: how much schema construction the model must perform, whether transport uses streaming, and where final structure realization happens. The MJ mode used minimal JSON output with the model emitting the final structured record directly. SJ retained direct JSON emission but relaxed output budget constraints. MJS preserved JSON targets while adding streaming transport. MCLR used compressed plain-text output with deterministic local reconstruction to create the final JSON structure. This systematic approach allowed researchers to isolate how packaging choices—not just model capabilities—affect routing performance.

The data reveals striking backend-specific patterns. On Gemini and OpenAI, direct JSON modes (MJ and SJ) preserved correctness most consistently, achieving 86.11% routing accuracy on Gemini and above 85% on OpenAI with perfect format compliance. However, when using compressed local reconstruction (MCLR), routing accuracy dropped to 62.96% on Gemini and 58.49% on OpenAI despite efficiency gains. Llama showed the sharpest incompatibility: while MJ and SJ maintained about 82.3-82.4% routing accuracy, MCLR collapsed to just 22.84% accuracy with format compliance falling to 53.40%. Statistical analysis confirmed these patterns, with backend × mode interaction showing extremely large effect sizes (partial eta squared of 0.960 for routing accuracy), indicating this interaction dominates performance more than individual backend or mode effects alone.

These have immediate practical for any organization using AI for workflow routing. The study introduces a "workflow lower-bound completion" metric that estimates the minimum rate at which a router can hand downstream systems a usable, correctly dispatched control record. On Gemini, direct JSON packages sustained 61.11% workflow completion, while compressed modes fell to 31.71%. On OpenAI, the drop was even more dramatic—from 57.5% to 8.31%. On Llama, compressed modes collapsed to 0% workflow completion. This means that efficiency-focused packaging choices could render critical specialist routes—like developer support or document retrieval—completely unreliable while appearing efficient on surface metrics like token usage and latency.

The research acknowledges several limitations that define its scope. The study focused on a compact four-route ontology (chat, task, dev, doc) rather than large routing spaces, and the prompt pool, while diverse, was smaller than real production traffic. ology doesn't directly measure downstream task success, instead focusing on routing quality, structure validity, and state retention. Additionally, the compressed local reconstruction mode bundles multiple design choices together, making it difficult to isolate the marginal effect of each micro-design decision. These limitations mean apply most directly to small-schema, enumerable-control routing tasks common in enterprise applications, but may not generalize to very large or open-ended routing ontologies.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn