Chapter 7. Principled failure handling


You have seen that resilience requires distributing and compartmentalizing systems. Distribution is the only way to avoid being knocked out by a single failure, be that hardware, software, or human; and compartmentalization isolates the distributed units from each other such that the failure of one of them does not spread to the others. The conclusion was that in order to restore proper function after a failure, you need to delegate the responsibility of reacting to this event to a supervisor.

The importance of ownership appeared already within the decomposition of a system according to divide et regna, expressed as the difference between a descendant module and a dependency. Descendants own a piece of the parent’s functionality, but foreign functions are incorporated only by reference. The resulting hierarchy gives the supervision structure for the modules.

7.1. Ownership means commitment

7.2. Ownership implies lifecycle control

7.3. Resilience on all levels

7.4. Summary