AI Query Planning: Cut Costs Without Losing Accuracy

TL;DR

A new method controls how often each AI model is consulted, lowering costs while keeping classification reliable for medical and content moderation tasks.

Organizations increasingly rely on multiple large language models (LLMs) to classify data, from diagnosing medical conditions to categorizing customer intent, but deciding how often to query each model has been a costly guessing game. A new study provides a rigorous framework for allocating queries across heterogeneous AI models to minimize expenses while guaranteeing accuracy for every possible outcome, addressing a critical inefficiency in real-world deployments. This approach could transform how companies leverage AI ensembles, moving from ad-hoc heuristics to data-driven planning that balances cost and reliability.

The researchers formulated this as an optimization problem: given multiple LLMs with different per-query costs and abilities to distinguish between labels, find the cheapest set of query counts that ensures misclassification probability stays below a target for each ground-truth label. They proved this problem is NP-hard, meaning it's computationally intractable to solve exactly for all instances, as shown by a reduction from the minimum-weight set cover problem. This hardness arises because ensuring correct classification for every label forces a combinatorial selection of models whose discriminative powers collectively cover all label pairs at minimum total cost, mirroring of covering a universe with weighted sets.

To overcome this intractability, the team developed a tractable surrogate by bounding the error probability. They used a union bound to decompose the multi-class error into pairwise comparisons between the true label and competitors, then applied Chernoff-type concentration bounds to each pairwise term. The resulting upper bound, denoted P_e(y; r) in the paper, factorizes multiplicatively across models and query counts, making constraint evaluation efficient. This surrogate is feasibility-preserving: any query plan satisfying the surrogate constraints automatically meets the original accuracy requirements, as established in Theorem 2.

The analysis shows this surrogate is asymptotically tight at the optimization level. As error tolerances shrink, the ratio of the surrogate-optimal cost to the true optimal cost converges to one, with an explicit rate of O(log log(1/α_min)/log(1/α_min)), where α_min is the smallest error tolerance. This means that in high-reliability regimes, solving the tractable surrogate yields essentially the same minimum cost as solving the intractable exact problem, preserving the first-order cost structure. The gap is negligible because the Chernoff proxy captures the correct exponential decay rate of error probabilities, losing only a polynomial prefactor that requires minimal additional query effort to absorb.

Building on this, the researchers designed an asymptotic fully polynomial-time approximation scheme (AFPTAS) that returns a surrogate-feasible query plan within a (1 + ε) factor of the surrogate optimum for any desired accuracy ε > 0. Algorithm 1 discretizes the continuous tilting parameters into a finite grid, then uses dynamic programming over rounded discrimination weights to find a near-optimal plan. This algorithm is polynomial in the number of models, log(1/α_min), and 1/ε for a fixed number of labels, making it practical for real-world applications where error tolerances are small and costs matter.

Are significant for any domain using multiple LLMs for classification. In healthcare, hospitals could optimize diagnostic model queries to reduce API fees while maintaining reliability across conditions. Streaming platforms might better allocate queries to intent-classification models, balancing latency and accuracy. Legal document-review pipelines could cut costs without compromising text classification quality. The framework provides a principled alternative to trial-and-error heuristics, enabling organizations to design query plans upfront based on model performance and cost data.

However, the approach has limitations. It assumes an offline setting where the query plan is fixed before any outputs are observed, which may not suit adaptive scenarios where decisions evolve with incoming data. The analysis relies on a uniform boundedness condition for log-likelihood ratios and requires every label pair to be distinguishable by at least one model, which might not hold if some models are entirely uninformative for certain comparisons. Additionally, while the AFPTAS guarantees efficiency, its practical runtime depends on parameters like the number of labels and error tolerances, though it remains polynomial for fixed label counts.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn