Large language models (LLMs) are transforming how computers understand human emotions in text, but they often struggle with accuracy, consistency, and high costs. Researchers have developed a new framework that integrates multiple LLMs to improve sentiment analysis, achieving up to 6% higher accuracy while offering clearer insights into decision-making processes. This advancement addresses key limitations in AI applications for finance and social media, where reliable sentiment interpretation is crucial for decision-making.
The key finding is that the Bayesian Network LLM Fusion (BNLF) framework enhances sentiment classification by combining predictions from models like FinBERT, RoBERTa, and BERTweet. As shown in the paper, this approach boosts accuracy to 78.6% on a combined test set, outperforming individual models and traditional ensemble methods such as majority voting and probability averaging. For example, FinBERT alone achieved 72.1% accuracy, while BNLF's integration led to consistent gains across diverse datasets, including financial news and informal tweets.
Methodologically, the researchers employed a late-fusion strategy, where each LLM processes text independently to generate sentiment predictions—negative, neutral, or positive. These outputs are then fused using a Bayesian network, which models dependencies and uncertainties among the predictions. The network structure, illustrated in Figure 1 of the paper, includes nodes for input text, individual model predictions, and a final probabilistic sentiment node. Parameters for the network were learned from training data using standard algorithms, with hyperparameters like a maximum parent count of 2 to maintain simplicity and interpretability.
Results analysis reveals that BNLF not only improves overall accuracy but also balances performance across sentiment classes. In class-level metrics, BNLF achieved an F1-score of 0.850 for positive sentiment, 0.639 for negative, and 0.667 for neutral, indicating robust handling of minority classes. The framework excelled particularly on challenging datasets like Twitter Financial News Sentiment (TFNS), where it reached 75.3% accuracy, compared to FinBERT's 73.2%. Inference analyses, such as those in Scenarios 1 and 2, demonstrate how BNLF adjusts certainty based on corpus type—for instance, showing higher confidence in negative sentiments for formal financial texts versus more uncertainty in informal tweets.
In context, this work matters because it makes AI sentiment analysis more practical and trustworthy for real-world use. By leveraging medium-sized models (e.g., 110M–135M parameters), BNLF reduces computational demands, making it feasible for resource-constrained environments without sacrificing performance. This is especially relevant in finance, where rapid sentiment assessment of news and social media can inform trading decisions, and in social platforms, where understanding public opinion requires handling sarcasm and slang. The framework's interpretability, through influence strength diagrams like Figure 8, helps users trace how predictions are formed, aligning with principles of transparent AI.
Limitations from the paper include the framework's reliance on discrete sentiment classifications rather than continuous confidence scores, which may restrict nuance in uncertain cases. Additionally, the current implementation does not dynamically learn network structures from data, potentially missing complex dependencies. Future work could expand to multilingual corpora or temporal analysis for evolving sentiments, but for now, BNLF offers a scalable step toward more reliable and explainable AI in text analysis.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn