Large language models have become go-to tools for information-seeking tasks, providing coherent answers that eliminate the need to sift through numerous documents. However, these models can produce factual errors—information that contradicts reliable sources like Wikipedia—which are particularly critical in domains such as health, law, finance, and security. These errors are difficult for users to detect because responses appear plausible, contain no obvious contradictions, and are expressed confidently. While various approaches exist to enhance factuality, such as using external knowledge and improving training techniques, evaluating long-form generation remains challenging, especially for rare entities where models' knowledge is weaker.
Researchers have developed a new dataset called R I D I C that systematically tests how well language models handle facts about entities with different levels of popularity. The dataset contains 3,000 entities across three domains: rivers, natural disasters, and car models, each categorized into head (most popular), torso, and tail (least popular) tiers based on Wikipedia pageview statistics. This approach builds on previous work like FActScore but extends it in size, domains, and languages, allowing for a more comprehensive evaluation of long-form factual accuracy in both English and Chinese. The key finding is that even frontier models hallucinate when generating information about rare entities, with factuality scores dropping significantly for tail-tier items compared to head-tier ones.
Ology involves a flexible pipeline for generating datasets with controlled entity popularity distributions. First, entities are collected from Wikidata using SPARQL queries for specific classes, such as rivers or natural disasters. Then, popularity metrics are calculated based on Wikipedia data, including pageviews, incoming hyperlinks, edits, and page length. Entities are sampled according to desired popularity and location characteristics, with additional criteria like excluding Wikipedia stubs to ensure sufficient evidence for evaluation. Finally, evidence is collected from Wikipedia pages, search , and linked pages to verify facts in LLM responses. The pipeline is multilingual, supporting languages like English and Chinese, and the code and data are publicly available for reproducibility.
From evaluating three LLMs—Llama-3-8B, Qwen2.5-7B, and GPT-5—on the R I D I C dataset show clear patterns. Factuality scores, measured as the ratio of supported facts to total facts extracted, are strongly correlated with entity popularity. For English generations, scores preserve the Head > Torso > Tail ordering across all models and domains. For example, Llama's factuality drops from 0.58 on head rivers to 0.25 on tail rivers, a more than twofold decrease. GPT-5 performs better overall but still shows gaps, with scores ranging from 0.74-0.88 on head entities to 0.50-0.63 on tail entities. In Chinese, all models show lower factuality scores compared to English, with totals ranging from 0.21 to 0.42, indicating s in non-English evaluation.
Additional analysis reveals that factuality varies by domain, with disasters and cars receiving higher scores than rivers, and that incorporating richer evidence (like search and linked pages) can sometimes harm accuracy, especially in Chinese. The study also found that smaller LLMs struggle more with long-tail facts due to limited parametric capacity, and that vocabulary richness in generations decreases for less popular entities. These have important for real-world applications, as they highlight the risks of relying on LLMs for accurate information about niche topics, particularly in multilingual contexts where reference data may be scarce.
Despite its contributions, the research has several limitations. Experiments were conducted on only three LLMs and two languages, limiting generalizability. Factuality evaluations for non-English languages are less reliable due to scarce high-quality reference data and lower LLM performance in these languages. The dataset excludes Wikipedia stubs and short articles, modestly biasing it toward popular entities and underrepresenting the long tail. Using English Wikipedia pageviews as a popularity signal may introduce bias, and reliance on Wikipedia as a primary source limits evidence coverage. Future work should address these issues by improving multilingual evidence collection, refining ambiguity resolution, and incorporating more diverse knowledge sources.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn