Assessing AI Systems Without Ground Truth Is Now Possible

TL;DR

A new framework measures AI risk by comparing systems to each other, skipping elusive correct answers and enabling safer use in critical applications.

As artificial intelligence systems increasingly handle critical tasks from healthcare to business operations, organizations face a fundamental challenge: how to evaluate whether a new AI system is safer than what it replaces when there's no definitive 'right answer' to measure against. A new framework called MARIA (MArginal RIsk Assessment without Ground Truth) offers a practical solution by shifting focus from absolute performance to relative risk comparison with existing systems.

The key finding is that AI systems can be evaluated for safety and reliability by comparing their behavior to established baselines rather than chasing unattainable ground truth. Researchers developed three complementary assessment methods that measure how an AI system changes risk levels when introduced into existing workflows. This approach acknowledges that many real-world systems—from document review to healthcare decisions—operate without clear 'correct' outcomes, making traditional evaluation methods impractical.

The methodology employs three types of proxy metrics that don't require ground truth labels. Predictability measures examine whether the system behaves consistently under repetition and controlled perturbations, using tests like self-consistency across multiple runs and input stability with paraphrased versions of the same content. Capability assessment evaluates the underlying AI model's competence through standardized benchmarks and operational efficiency metrics. Dominance evaluation uses structured interactions like persuasion duels and prediction games to test behavioral robustness and adaptability.

In a case study evaluating AI-assisted document review, the framework revealed important insights. The AI system showed high internal consistency with self-consistency scores comparable to human reviewers, but exhibited different scoring behaviors that weren't interchangeable with human evaluators. When integrated into human review workflows, the AI system changed the frequency of third-reviewer triggers needed to resolve discrepancies, indicating shifts in process stability. The system also showed mild distributional differences that could advantage certain applicant groups, highlighting potential fairness concerns that traditional evaluation might miss.

The framework matters because it enables organizations to make informed decisions about AI deployment in real-world scenarios where ground truth is unavailable, delayed, or too expensive to obtain. For government agencies reviewing grant applications, healthcare systems making treatment decisions, or businesses evaluating vendors, this approach provides actionable guidance about whether adopting AI increases, decreases, or maintains risk levels compared to current practices. It shifts the conversation from whether an AI system is 'correct' to how it changes the risk profile of existing operations.

Limitations include the framework's dependence on the assumption that existing baselines represent acceptable risk levels, which may not always hold true. The approach also cannot completely eliminate concerns about gaming the evaluation metrics or address all potential biases in automated scoring systems. As with any evaluation method, results remain context-dependent and require careful interpretation based on the specific application domain and risk priorities.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn