AI Teams Beat Solo Models in Text Categorization

TL;DR

Using multiple AI models together boosts accuracy by up to 65%, cutting errors and hallucinations to near-human levels in content classification.

In the world of artificial intelligence, categorizing text accurately is a fundamental , especially when dealing with unstructured content like articles or social media posts. This task requires compressing rich, detailed information into a limited set of predefined categories, such as those in the Interactive Advertising Bureau (IAB) taxonomy, which has 698 categories across four hierarchical levels. Single large language models (LLMs) often struggle with this, leading to inconsistencies, hallucinations—where models generate non-existent categories—and category inflation, where too many labels are assigned. These issues can result in unreliable outputs, making it difficult for applications in areas like advertising, content moderation, and data indexing to depend on AI alone. The new ensemble approach, called eLLM, addresses these weaknesses by pooling the strengths of multiple models, offering a more robust solution that could transform how we automate text classification.

The key finding from this research is that combining multiple LLMs into an ensemble significantly improves categorization performance. In experiments with ten state-of-the-art models, including Claude, Gemini, and GPT variants, the ensemble framework achieved up to a 65% increase in F1-score—a balanced measure of accuracy and recall—compared to the best single model. For instance, while the top-performing individual LLM had an F1-score of 0.55, ensembles with just two models reached 0.73, and larger groups of ten models hit 0.92. This improvement means the ensemble not only reduces errors like hallucinations and category inflation but also approaches the consistency of human expert annotations, as demonstrated in tests on a dataset of 8,660 human-labeled text samples. By leveraging collective decision-making, the eLLM system ensures that only categories with strong consensus among models are selected, leading to more reliable and precise classifications.

Ology behind this breakthrough involves treating each LLM as an independent expert and using a collective decision-making (CDM) algorithm to combine their predictions. Researchers evaluated models under uniform zero-shot conditions, meaning no task-specific training was used, and applied structured prompts to guide categorization through the IAB taxonomy's hierarchical levels. The CDM framework calculates a relevance score for each category based on factors like how often it appears across models (popularity), its depth in the taxonomy (importance), and its semantic proximity to other categories. Only categories with scores above a consensus threshold—optimally set at 0.65 after testing—are included in the final output. This process mimics how a committee of humans might reach a decision, aggregating diverse perspectives to cancel out individual errors and enhance overall accuracy.

From the study provide clear evidence of the ensemble's superiority. In one example, a text passage about a fictional story was categorized by ten LLMs, with individual outputs varying widely—some models suggested categories like 'Religion & Spirituality' or 'Science,' while others stuck to 'Fiction' and 'Young Adult Literature.' After applying the CDM algorithm, the ensemble consensus correctly identified only 'Fiction' and 'Young Adult Literature,' matching the human expert benchmark perfectly. Across the full dataset, ensembles showed substantial gains: for instance, a two-model ensemble improved F1-score by 33%, and a ten-model ensemble boosted it by 67%. Tables in the paper detail how larger ensembles consistently outperformed smaller ones, with F1-scores rising from 0.73 for two models to 0.92 for ten, while also reducing hallucination rates and category inflation, making the outputs more trustworthy and aligned with real-world needs.

Of this research are far-reaching for practical applications. In fields like programmatic advertising, academic indexing, and regulatory compliance, where accurate text categorization is crucial, the eLLM framework could enable fully automated pipelines that reduce reliance on costly human labeling. By achieving near-human performance, ensembles offer a scalable solution that maintains high precision and recall, potentially cutting operational costs and increasing efficiency. For example, in content moderation, this could mean faster and more consistent filtering of inappropriate material, while in data governance, it might improve how organizations organize and retrieve information. The ability to combine models from different architectures and training backgrounds also highlights a shift toward collaborative AI, where teamwork among algorithms leads to better outcomes than any single system could achieve alone.

Despite these advantages, the approach has limitations, primarily related to computational cost. Running multiple LLMs in parallel increases token processing expenses, with pricing varying from $0.04 to $30.00 per million tokens, as noted in the paper's cost analysis. This could make large-scale deployments expensive, though the authors suggest that declining inference costs over time may mitigate this issue. Additionally, the study focused on the IAB taxonomy, which, while comprehensive, has only 698 categories and may not capture all nuances of broader taxonomies like DMOZ with over 750,000 nodes. Future work could explore dynamic ensemble compositions or cost-performance optimizations to make the system more accessible, but for now, the trade-off between accuracy and expense remains a key consideration for real-world adoption.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn