AI Creates Fake Social Data That Performs Like Real

TL;DR

Researchers use AI to generate synthetic emotional text that matches real sentiment analysis results, but sacrifices some linguistic diversity.

As social media platforms restrict data access and costs soar, researchers face a critical : how to train AI systems to understand public sentiment without enough real examples. A new study offers a promising solution by using AI's own explanations to generate synthetic emotional text that performs as well as real data for emotion recognition tasks. This approach could help organizations analyze public opinion, track social movements, and understand political sentiment without the prohibitive costs of acquiring authentic social media data.

The researchers discovered that when guided by interpretability techniques, synthetic data can achieve performance equivalent to real data expansion for emotion classification. Their SHAP-guided approach—which uses Shapley Additive Explanations to identify emotion-relevant keywords—achieved F1 scores between 0.48 and 0.53 across all data increments from 1,000 to 2,000 samples, matching real data expansion performance. Most notably, this excelled for underrepresented emotion classes like optimism, which comprised only 8.8% of the data, consistently outperforming both real data expansion and naive generation approaches. For the majority anger class (42% of data), all strategies showed stable performance with marginal differences, indicating that adequate training data reduces sensitivity to augmentation quality.

Ology combines traditional machine learning with large language model generation in a carefully designed framework. The researchers first identified the most reliable classifier among six traditional ML algorithms, selecting XGBoost for its consistent high performance and stability. They then used SHAP analysis on seed data to extract emotion-specific keywords through a differential TF-IDF scoring system that identifies terms more frequent in target emotions versus others. These keywords were filtered by SHAP importance scores to exclude frequent but non-predictive words, then incorporated into LLM generation prompts alongside real tweet exemplars. The system instructed the LLM to naturally include positive keywords and avoid negative ones, creating synthetic emotional text that mimics authentic social media patterns while emphasizing discriminative features.

Reveal a complex relationship between synthetic data quality and seed data quantity. With a 1,000-sample baseline, SHAP-guided augmentation achieved performance parity with real data expansion, while naive generation showed consistent degradation from 0.48 to 0.45 F1 scores. However, this effectiveness diminished with smaller seed sizes: at 500 samples, SHAP-guided generation initially matched real data but plateaued at higher increments, and at 100 samples, it offered only marginal improvement over naive generation while remaining consistently below real data. The linguistic analysis uncovered a fundamental trade-off: synthetic text exhibited reduced vocabulary richness with Type-Token Ratios of 0.133 for SHAP-guided and 0.143 for naive generation compared to 0.241 for real data, while showing higher lexical overlap with real data across all emotions.

Extend beyond technical performance to practical applications in sentiment analysis and AI development. This approach addresses fairness concerns by enhancing representation for minority emotion classes, potentially improving emotion recognition systems' ability to detect less common but important emotional states in public discourse. However, the reduced linguistic diversity of synthetic data—characterized by fewer personal expressions and less temporal complexity than authentic posts—means it cannot fully replace real data for capturing social media's evolving linguistic nuances. provides a viable solution for data-constrained scenarios but requires minimum seed data thresholds and continued real data acquisition to prevent overfitting to repetitive patterns.

Several limitations constrain the current approach's applicability. The study found that interpretability-guided generation requires minimally sufficient seed data to extract meaningful emotional patterns, with performance advantages diminishing below 500 samples. The synthetic text's reduced lexical diversity and underrepresentation of personal voice patterns may limit generalization across evolving social media contexts and demographic variations. Additionally, the research focused on traditional ML classifiers rather than transformer-based architectures, and the approach's effectiveness across different social media platforms with varying linguistic norms remains untested. These constraints establish clear guardrails for responsible synthetic data use in emotion recognition applications.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn