New AI Method Makes Network Failures Easy to Read

TL;DR

A new quorum design separates network tiers so operators can instantly see which parts stay functional during outages, no complex probing needed.

In distributed systems like cloud networks or interplanetary communications, when a failure occurs, it can be difficult to determine which parts of the system are still operational. Traditional consensus protocols often treat all nodes as interchangeable, masking whether a failure stems from a whole network tier being unreachable or just a few nodes crashing within a tier. A new approach, detailed in a recent paper, addresses this by mapping a quorum construction called a crumbling wall to physically tiered networks, making failure modes legible. This means operators can quickly identify which tiers retain global consensus capability during outages, such as Mars conjunction blackouts in space networks or planned maintenance in terrestrial systems, without runtime probing. separates inter-tier obligation, which ensures safety across different network layers, from intra-tier replication, which handles local durability, a conflation that flat quorums fail to resolve.

The key finding is that this topology-aware quorum design allows three out of four tiers to maintain global consensus during hard blackouts, as demonstrated in a 10-node topology spanning Earth, Low Earth Orbit (LEO), the Moon, and Mars. Specifically, during a Mars conjunction blackout where Mars is disconnected, Earth, LEO, and Moon tiers retain 100% success rates for global Phase 1 consensus, while only Mars loses it. The researchers confirmed this using a discrete-event simulator called Eidolon, with showing that liveness failures crumble from the top of the wall—meaning the disconnected tier is the only one affected. Consensus latencies at each tier align with physics: Earth averages 183 ms, LEO 131 ms, and Moon 5.1 seconds, reflecting the speed-of-light round-trip times to Earth. This legibility property enables an operator to check which tiers are functional with a simple O(tiers) procedure, based on the wall structure and connectivity state, rather than enumerating quorum subsets.

Ology involves composing two existing ideas: Flexible Paxos, which decouples Phase 1 and Phase 2 quorums for consensus safety, and crumbling-wall quorum systems, which use asymmetric structures for high availability. The researchers mapped the rows of a crumbling wall to physical latency tiers, with Earth as the bottom row (fastest) and Mars as the top (slowest). For each tier, Phase 1 quorums require at least one node from that tier and every tier below it, ensuring intersection with Phase 2 quorums that are anchored at Earth. This construction was verified exhaustively using TLA+ model checking over the full 10-node topology, with no intersection failures found. Experiments simulated various scenarios, including Mars blackouts and sparse network topologies, with parameters like Mars delays ranging from 186 to 1342 seconds and blackout durations up to 1800 seconds, using 50 seeds per data point for statistical reliability.

Analysis reveals that quorum geometry alone changes outcomes: a flat quorum construction requiring all tiers yields 0% during-blackout success, while the crumbling wall achieves 100% for Earth-initiated consensus. In a sparse topology where LEO has links to only 3 of 5 Earth ground stations, LEO drops to 0% success during blackout, demonstrating that wall obligations and network reachability are independent constraints that compose to determine liveness. The paper also explores crash tolerance by relaxing Phase 2 requirements from all five Earth nodes to k-of-five, showing that with two Earth crashes, the global construction maintains 98% success, whereas a standard Earth-local quorum drops below 50%. Additionally, the wall imposes a leadership cost gradient, with Earth having 4.6 times more valid Phase 1 quorums than Mars, giving Earth leaders more survivable crash patterns in Multi-Paxos elections—a feature symmetric grid quorums cannot express.

Extend beyond interplanetary networks to terrestrial applications like edge-cloud systems, where structured latency asymmetry exists. For example, in a 3-tier deployment with cloud, metro edge, and remote sites, the wall predicts that during WAN maintenance disconnecting the remote site, only that site loses global consensus, while cloud and edge tiers remain functional. This separation principle provides a formal foundation for practices already used informally in distributed systems, enhancing fault tolerance and operational clarity. However, limitations include the design-level nature of , which abstract away real-world factors like orbital dynamics and variable link quality, and the focus on crash-stop failures without addressing Byzantine behavior or storage issues. Future work could explore scoped write leases or empirical validation in terrestrial topologies to further refine the approach.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn