Formally-Guided Self-Healing Software: Guaranteeing Trust and Resilience in Mission-Critical Systems

I. Executive Summary

Safety-critical systems, such as those used in aerospace, maritime, and defense domains, demand exceptional reliability and robustness. For instance, the Ariane 5 rocket failure in 1996 due to a software bug led to a catastrophic explosion shortly after launch, costing millions of dollars. Similarly, the Therac-25 radiation therapy machine's software defects in the 1980s resulted in multiple patient fatalities. Such examples underscore the dire consequences of undetected software issues in these domains. Software defects in such systems can lead to catastrophic failures, including loss of life, mission failure, and significant financial repercussions. This white paper proposes a novel research initiative focused on developing automated program repair techniques tailored for safety-critical systems. By leveraging advanced formal verification, refinement types, runtime monitoring, self-healing mechanisms, and human-in-the-loop feedback, this research uniquely addresses the intricate challenges in safety-critical systems. These approaches ensure that repairs are not only precise and compliant with strict operational and safety standards but also adaptive to dynamic and adversarial environments. Unlike traditional methods reliant on extensive manual intervention, this framework integrates conflict-driven learning, active learning, and automated reasoning to provide scalable, robust solutions tailored to the unique demands of safety-critical environments.

II. Background and Motivation

Modern safety-critical systems are characterized by increasing software complexity, stringent performance requirements, and the necessity for real-time operation. Despite rigorous testing and validation processes, latent software bugs persist due to the undecidability nature of formal verification and challenges in fully exploring dynamic operational conditions. For example, in 2015, a latent bug in the autopilot software of a commercial aircraft led to incorrect altitude adjustments, requiring immediate pilot intervention to avoid a potential collision. Such scenarios highlight the risks inherent in undetected software defects.

High system complexity: Complex interactions among software components make comprehensive testing challenging, leading to vulnerabilities in integration points and overlooked edge cases.
Dynamic environments: Safety-critical systems often operate in unpredictable conditions, exposing previously untested code paths and increasing the likelihood of latent bugs triggering during deployment.
Legacy code bases: Many systems rely on outdated or poorly documented code, complicating maintenance and increasing the risk of introducing new defects during updates.

Traditional debugging and repair processes are manual, time-intensive, and error-prone, exacerbating delays and escalating costs. The introduction of automated program repair mechanisms that leverage refinement-type-driven specifications, runtime monitoring, self-healing, and human-in-the-loop feedback offers a transformative approach. By enabling scalable, precise, and adaptive solutions tailored to safety-critical environments, these mechanisms significantly mitigate the risks posed by software defects and operational uncertainties.

III. Proposed Approach

The overall architecture of the proposed system.

This initiative will explore a comprehensive strategy for automated program repair in safety-critical systems, integrating refinement-type-driven specifications, conflict-driven self-healing mechanisms, and human-in-the-loop enhancements tailored for real-world operational challenges:

Model-Based Specification & Formal Verification.

We begin by creating a rigorous, model-based description of the UAV’s software components, such as flight control logic, sensor processing, and communication protocols. Each component will be expressed using refinement types, which enhance type systems by encoding rich logical constraints. Refinement types are particularly suitable for specifying and enforcing properties such as permissible pitch, roll, and altitude ranges or ensuring that only authenticated commands can alter UAV trajectories. These constraints can be statically checked during design time, offering guarantees about software behavior. For example, flight envelope constraints are expressed as refinements of numeric types, ensuring compliance with safety-critical properties at both compile-time and runtime.

By using refinement types, we can instantiate and verify critical properties with precision and scalability. For instance, constraints like "altitude > 0" or "pitch within [-30, 30] degrees" are encoded directly within the type system, ensuring their enforcement throughout the UAV’s lifecycle. These refinement types form the basis for formal verification, where tools such as SMT solvers and theorem provers validate functional correctness and security guarantees. By structuring the software as a set of verifiable contracts defined via refinement types, we preclude common error classes (e.g., buffer overflows or invalid sensor reads) and streamline certification efforts.

Example 1: Flight Envelope Constraints for UAVs

For a tactical autonomous UAV, violating flight envelope constraints (e.g., exceeding maximum roll or pitch angles) can lead to loss of control or structural failure. Using refinement types, we ensure:

The UAV never exceeds ±60° roll angle at low altitude.
The UAV never descends below a safe altitude threshold.

Using Liquid Haskell (a refinement-type-based verification tool), we express these properties:

type Altitude = { v: Double | v > 100.0 }  -- UAV must always maintain above 100 meters
type RollAngle = { v: Double | v >= -60.0 && v <= 60.0 }  -- Roll angle constraint

data UAV = UAV {
    altitude :: Altitude,
    roll :: RollAngle
}