Chapter 12. Fault tolerance and recovery patterns


In this chapter, you will learn how to incorporate the possibility of failure into the design of your application. We will demonstrate the patterns on the concrete use case of building a resilient computation engine that allows batch job submissions and their execution on elastically provisioned hardware resources. We will build on what you learned in chapters 6 and 7, so you may want to revisit them.

We will start by considering a single component and its failure and recovery strategies and then build up more-complex systems by means of hierarchical composition as well as client–server relationships. In particular, we will discuss the following patterns:

  • The Simple Component pattern (a.k.a. the single responsibility principle)
  • The Error Kernel pattern
  • The Let-It-Crash pattern
  • The Circuit Breaker pattern

12.1. The Simple Component pattern

A component shall do only one thing, but do it in full.

This pattern applies wherever a system performs multiple functions or the functions it performs are so complex that they need to be broken into different components. An example is a text editor that includes spell checking: these are two separate functions (editing can be done without spell checking, and spelling can also be checked on the finished text and does not require editing capabilities), but on the other hand, neither of these functions is trivial.

12.2. The Error Kernel pattern

12.3. The Let-It-Crash pattern

12.4. The Circuit Breaker pattern

12.5. Summary