AIResearch AIResearch
Back to articles
Ethics

AI Learns When to Stop Accepting Risk

New research reveals optimal decision policies for AI systems facing selective data gaps, with implications for lending, hiring, and criminal justice applications.

AI Research
November 14, 2025
3 min read
AI Learns When to Stop Accepting Risk

When artificial intelligence systems make high-stakes decisions about loans, job candidates, or bail applications, they face a fundamental challenge: they only learn the outcomes of the people they accept, creating blind spots that can perpetuate suboptimal policies. A new study from IBM Research provides mathematical proof of the optimal decision strategies for these selective label scenarios, offering a clearer path for AI systems to balance learning with immediate rewards.

The researchers discovered that the best policy depends critically on whether the system prioritizes long-term learning or immediate performance. For AI systems maximizing discounted total reward—where future gains are weighted less heavily—the optimal approach is a threshold policy that becomes increasingly strict as more data is collected. This means the system starts with lenient acceptance criteria but gradually raises the bar, eventually settling on the theoretically optimal threshold once it has sufficient information. In contrast, for systems focused on undiscounted average reward over infinite horizons, the optimal policy always accepts individuals with positive probability, maintaining constant exploration regardless of accumulated knowledge.

The methodology treated the selective labels problem as a partially observable Markov decision process (POMDP), where the AI system maintains a belief state about the success probability of the population. This belief state evolves through a beta distribution that updates with each observed outcome, allowing the system to track its uncertainty about the true success rate. The researchers derived dynamic programming recursions that specify the optimal acceptance probability at each belief state, showing that for discounted reward maximization, the policy reduces to a simple threshold rule where acceptance occurs only when the expected value of continuing exceeds stopping.

The analysis revealed concrete mathematical properties of these optimal policies. The value function—representing expected future rewards—was proven to be non-decreasing and convex in the estimated success probability, while the optimal threshold becomes more stringent as the number of observations increases. Computational experiments with parameters c=0.8 (cost of failure relative to success reward) and discount factors γ=0.99 and 0.95 demonstrated that the policy approaches the theoretical optimum as data accumulates, with the approximation requiring only modest computational resources (less than one second on a standard laptop for 1000 iterations).

These findings matter because they provide principled guidance for AI systems making real-world decisions with incomplete information. In lending, this could help financial institutions optimize loan approval rates while learning about repayment probabilities. For hiring managers, it suggests how to balance the need to evaluate candidates against the risk of poor hires. In criminal justice, it offers insights into bail decisions where outcomes are only observed for released defendants. The threshold policy approach avoids the ethical concerns of stochastic decision-making while still enabling effective learning.

The research acknowledges several limitations. The current analysis assumes a homogeneous population without distinguishing features, leaving open how to extend these results to more realistic scenarios with multiple subgroups. The approximation error for finite computational horizons remains unquantified, and the undiscounted average reward case provides limited guidance on specific policy selection beyond the requirement for positive acceptance probabilities. Future work will need to address how these optimal policies translate to settings with fairness constraints across different demographic groups.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn