chapter three

3 Failure tolerance

This chapter covers

Failure and failure tolerance
Failure detection and mitigation
Application- and platform-level failures
Transient, intermittent, and permanent failures
An ideal failure-handling strategy

Now that we have defined the notion of system models, discussed widely used system models including synchronous and asynchronous distributed systems, and explored the concepts of order, physical time, and logical time, we can explore failure, failure tolerance, and failure handling—in short, ways to think about failure. While reading this chapter, keep in mind that the primary objective of thinking about failure is to ensure failure tolerance, which refers to the guarantee that a distributed system functions in a well-defined manner even when failures occur.

The topic of failure, failure tolerance, and failure handling in distributed computing is broad, encompassing a significant body of theoretical and practical work. Therefore, this chapter is divided into two main sections to provide a well-rounded perspective: The first main section explores thinking about failure in theoretical terms; the second main section explores thinking about failure in practical terms.

3.1 In theory

Informally, a failure is an unwanted but possible state transition of a system. On failure, the system transitions from a good state to a bad state. Failure tolerance is the ability of a system to behave in a well-defined manner when the system is in a bad state.

3.2 Types of failure tolerance

3.2.1 Masking failure tolerance

3.2.2 Nonmasking failure tolerance

3 Failure tolerance

This chapter covers

3.1 In theory

3.2 Types of failure tolerance

3.2.1 Masking failure tolerance

3.2.2 Nonmasking failure tolerance

3.2.3 Fail-safe failure tolerance

3.2.4 None of the above

3.3 In practice

3.3.1 System model

3.3.2 Failure handling

3.3.3 Failure classification

3.3.4 Failure detection

3.3.5 Failure mitigation

3.3.6 Putting everything together

Summary