Small AI Models Match Giants in Software Analysis

Software development teams face a critical challenge: analyzing thousands of requirements documents to ensure projects meet specifications while protecting sensitive company information. New research reveals that small, locally deployable AI models can perform this classification task nearly as accurately as massive commercial models, offering significant privacy and cost advantages without sacrificing performance.

The study compared eight AI models on their ability to classify software requirements into categories like functional versus non-functional requirements. The researchers found that while large language models (LLMs) like GPT-5 and Claude-4 showed slightly higher overall accuracy, small language models (SLMs) with 7-8 billion parameters achieved comparable results—despite being 100 to 300 times smaller than their commercial counterparts.

The methodology employed a standardized testing approach across three established software engineering datasets: PROMISE, Reclass, and SecReq. Each model was evaluated using chain-of-thought prompting combined with few-shot examples, running each test three times and using majority voting for final classifications. The SLMs were deployed locally on standard research hardware, while LLMs were accessed through commercial APIs, mirroring real-world deployment scenarios.

Results showed that LLMs achieved an average F1 score of 0.818 across all datasets, while SLMs reached 0.793—a difference of only 2% that statistical analysis revealed was not significant. More importantly, SLMs demonstrated specific strengths in critical areas. On the Reclass dataset, models like Qwen2-7B and Falcon3-7B achieved recall scores of 0.96, significantly outperforming all LLMs in identifying relevant instances and reducing false negatives. The top-performing SLM, Llama-3-8B, reached F1 scores of 0.76, 0.78, and 0.88 across the three datasets, closely trailing the best LLM performance of 0.81, 0.77, and 0.89 respectively.

This finding matters because software requirements often contain proprietary business information that companies cannot risk exposing to external AI services. The ability to deploy capable AI models locally addresses fundamental privacy and security concerns while maintaining classification accuracy. For organizations handling sensitive software specifications—from healthcare systems to financial applications—this represents a practical solution to leverage AI capabilities without compromising data confidentiality.

The study acknowledges several limitations. The analysis focused exclusively on binary classification tasks, leaving open questions about performance on more complex requirements engineering activities. Additionally, the relatively small sample size of eight models may have limited statistical power to detect subtle performance differences. The research also did not account for execution speed trade-offs, with SLMs taking approximately 400 seconds per task compared to 138-300 seconds for commercial models, though this gap reflects infrastructure differences rather than inherent model capabilities.

These findings challenge the assumption that larger models necessarily deliver superior performance for specialized tasks. The research demonstrates that for software requirements classification—a critical step in ensuring project success—smaller, more deployable models offer a viable alternative that balances accuracy with practical considerations of privacy, cost, and customization.

Small AI Models Match Giants in Software Analysis

About the Author

Guilherme A.