Distributed systems power everything from cloud computing to online banking, but they are notoriously fragile. When one component crashes, the entire application can fail. Researchers at the University of Utah have developed a new programming language, Chorex, that brings a novel solution to this decades-old problem: it automatically restarts failed actors and resets the system to a safe state, making distributed applications inherently more robust. This approach directly tackles one of the 'eight fallacies of distributed computing'—the mistaken belief that network topology doesn't change—by building fault tolerance directly into the programming model.
Chorex introduces a choreographic programming model to the Elixir language. Instead of writing separate, error-prone code for each component in a distributed system, developers write a single 'choreography' that describes the global interactions between all actors. The key innovation is the 'checkpoint/rescue' block. When programmers wrap a section of code in a checkpoint block, Chorex automatically saves the state of every actor. If any actor crashes—for example, due to a division by zero error—a runtime monitor detects the failure, spawns a new process, restores its state from the checkpoint, and instructs all actors to execute the alternative 'rescue' block. This allows the system to recover and continue without manual intervention, as demonstrated in a minimal example where 'Alice' crashes but is restarted to successfully exchange messages with 'Bob'.
Ology relies on a sophisticated compilation strategy and runtime supervision. The Chorex compiler, implemented as an Elixir macro, translates choreographies into sets of stateless functions for each actor, targeting Elixir's GenServer behavior for message handling. This enables out-of-band communication for recovery messages. A runtime monitor process supervises all actors; if one crashes, the monitor restarts it, restores its checkpointed state (including control stack, variable bindings, and message inbox), and broadcasts the new actor's address to others. The system uses CIV tokens—unique identifiers containing session data and source-code location—to ensure message integrity. Crucially, Chorex integrates with standard Elixir tooling, providing compile-time errors for mismatches and IDE autocomplete for required actor functions, as shown in Figure 1 of the paper.
Performance benchmarks reveal the overhead of this fault-tolerance mechanism. In a 'State Machine' test based on a TCP server, the checkpointing version ran with only 1.01x overhead compared to a version without checkpoints, and the crashing/recovery version had 1.04x overhead. More computationally intensive tests showed higher costs: a 'Mini Blockchain' hashing loop saw 6.76x overhead for checkpointing and 4.71x for recovery, while a recursive 'Nest-10k' microbenchmark had 3.48x and 1.96x overhead, respectively. The paper notes that deep recursive calls inside checkpoint blocks increase memory usage, but the monitor saves stack deltas to mitigate this. Compile times scale linearly with actor count, taking about 11 seconds for 100 actors.
Are significant for building reliable distributed software. Chorex has been used to implement real-world protocols like the Secure Remote Password (SRP) authentication and a TCP socket server, demonstrating practical utility. In the SRP example, the choreography clearly shows which values cross between client and server, making it easier to verify that secrets like passwords never get transmitted. For the TCP server, the top-level choreography can spawn multiple instances of an inner handler choreography for each client connection, with clean exit behavior if anything goes wrong. This approach reduces a classic source of bugs: ensuring that separately implemented components correctly follow a global protocol.
However, the research acknowledges limitations. Chorex currently requires programmers to manually specify which actors need 'knowledge-of-choice' notifications in conditional branches, though missing annotations trigger compile-time errors. The checkpoint/rescue mechanism does not save messages that arrive during a checkpoint block if a crash occurs, though the paper argues this is not problematic because all such messages originate from within the same block. Future work includes extending Elixir's emerging type system to check choreographies, implementing features like full out-of-order execution and census polymorphism (abstraction over participant count), and exploring verification techniques. The authors also note that Elixir's macro system made pattern-matching on syntax cumbersome, suggesting room for improvement in metaprogramming tools.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn