This chapter covers
- Failure and failure tolerance
- Failure detection and mitigation
- Application- and platform-level failures
- Transient, intermittent, and permanent failures
- An ideal failure-handling strategy
Now that we have defined the notion of system models, discussed widely used system models including synchronous and asynchronous distributed systems, and explored the concepts of order, physical time, and logical time, we can explore failure, failure tolerance, and failure handling—in short, ways to think about failure. While reading this chapter, keep in mind that the primary objective of thinking about failure is to ensure failure tolerance, which refers to the guarantee that a distributed system functions in a well-defined manner even when failures occur.
The topic of failure, failure tolerance, and failure handling in distributed computing is broad, encompassing a significant body of theoretical and practical work. Therefore, this chapter is divided into two main sections to provide a well-rounded perspective: The first main section explores thinking about failure in theoretical terms; the second main section explores thinking about failure in practical terms.
3.1 In theory
Informally, a failure is an unwanted but possible state transition of a system. On failure, the system transitions from a good state to a bad state. Failure tolerance is the ability of a system to behave in a well-defined manner when the system is in a bad state.