Measuring Confidence in Safety Arguments Made Simple

TL;DR

A new method helps engineers quantify trust in safety cases, balancing risk and cost without false precision or guesswork.

In high-stakes fields like aviation, automotive, and nuclear systems, safety assurance cases are critical for demonstrating that a system is safe to operate. These structured arguments link evidence to safety claims, but a persistent has been quantifying how much confidence we should have in them. Too often, confidence assessments rely on informal judgments or overly complex probabilistic models that can mislead decision-makers. A new , detailed in a technical report from SRI International, offers a simpler, systematic approach to measuring probabilistic confidence in assurance cases, specifically within the Assurance 2.0 framework. This development is significant because it provides engineers and regulators with a tool to make more rational trade-offs between the cost of gathering evidence and the level of confidence needed for different risk levels, without falling into the trap of false precision that has plagued past efforts like r disaster.

The researchers, Robin Bloomfield and John Rushby, propose a that calculates confidence compositionally, moving from the bottom of an assurance argument—where evidence and assumptions reside—up to the top claim. Unlike previous approaches that used elementary calculations like the product of confidences or sum of doubts, which often produced overly conservative or insensitive , the new tailors calculations to the type of reasoning used in each argument step. For example, in a decomposition block where subclaims address different concerns, distinguishes between cases like diversity arguments, where subclaims eliminate concerns in independent ways, and partitioned arguments, where subclaims cover disjoint sets of concerns. This allows for more accurate and meaningful confidence estimates that reflect the actual structure of the argument.

Ology builds on the Assurance 2.0 framework, which separates logical assessment, dialectical examination, and quantitative confidence evaluation. Confidence is expressed as a subjective probability, and uses only elementary probabilistic constructions, such as Fréchet bounds, adapted specifically for assurance arguments. For evidence nodes, confidence is estimated as the conditional probability of a useful claim given measured evidence, multiplied by confidence in any sideclaims. For reasoning steps with multiple subclaims, different formulas apply based on the argument type: in diversity cases, confidence is calculated using the product of doubts, while in partitioned cases, it uses a weighted average. also handles more complex scenarios, such as cumulative arguments where subclaims build on each other, by applying conditional probability chains or Bayesian Belief Networks for intricate dependencies.

, As illustrated through examples in the paper, show that can produce plausible confidence values that align with intuitive expectations. For instance, in a diversity argument combining testing and static analysis with 95% and 90% confidence respectively, yields 99.5% confidence in the parent claim when independence is assumed, dropping to 89.5% if confidence in the sideclaim about diversity is only 90%. In partitioned arguments, such as one addressing three hazards with weights of 60%, 30%, and 10% and subclaim confidences of 90%, 95%, and 80%, calculates 90.5% confidence before adjusting for sideclaims. The paper also demonstrates how avoids counterexamples identified by critics like Graydon and Holloway, such as cases where many subclaims overwhelm a weak one, by ensuring logical soundness is assessed separately before quantitative analysis.

Of this work are practical for industries where safety is paramount. By providing a systematic way to quantify confidence, helps in evaluating cost-confidence trade-offs for different risk levels, such as those defined by Design Assurance Levels in aviation or Safety Integrity Levels in automotive standards. It enables engineers to identify weak points in an argument, compare alternative assurance strategies, and ensure a balanced distribution of confidence across the entire case. Moreover, it supports graduated assurance approaches, where lower-risk items may require less rigorous evidence, by offering a numerical basis for such decisions. This can lead to more efficient resource allocation while maintaining safety standards.

However, has limitations, as noted in the paper. It relies on subjective probability estimates, which can vary among assessors, and its accuracy depends on correctly identifying the type of argument step and the independence or diversity of subclaims. In cases with complex dependencies among subclaims, the basic formulas may not suffice, necessitating the use of Bayesian Belief Networks, which require additional expertise and modeling effort. Additionally, assumes that residual concerns have been addressed and classified as negligible or manageable before quantitative assessment, which may not always be straightforward in practice. Despite these s, the researchers argue that their approach, by complementing rather than replacing logical and dialectical assessments, offers a valuable tool for enhancing the rigor and transparency of safety assurance in critical systems.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn