5 Failure tolerance

This chapter covers

Failure and failure tolerance
Failure detection and failure mitigation
Application-level and platform-level failures
Transient, intermittent, and permanent failures
An ideal failure-handling strategy

Now that we have defined the notion of system models, discussed widely used system models like synchronous and asynchronous distributed systems, and explored the concepts of order, physical time, and logical time, we can explore failure, failure tolerance, and failure handling—in short, how to think about failure.

While reading this chapter, keep in mind that the primary objective of thinking about failure is to ensure failure tolerance, which refers to the guarantee that a distributed system functions in a well-defined manner even when failures occur.

Note

The terms fault, error, and failure are subject to considerable ambiguity. What one author may refer to as a fault, another may label as an error or a failure. Although some authors attempt to distinguish between these terms, there is no universally accepted definition. This book will simply use the term failure. By using this term, we aim to reduce confusion and provide clarity in our discussion of these complex systems. As a result, we will also use less frequently used terms, such as failure tolerance instead of the more common fault tolerance.

5 Failure tolerance

This chapter covers

Note

5.1 In theory

5.1.1 Types of failure tolerance

5.2 In practice

5.2.1 System model

5.2.2 Failure handling

5.2.3 Failure classification

5.2.4 Failure detection

5.2.5 Failure mitigation

5.2.6 Putting everything together

5.3 Summary

5 Failure tolerance

This chapter covers

Note

5.1 In theory

5.1.1 Types of failure tolerance

5.2 In practice

5.2.1 System model

5.2.2 Failure handling

5.2.3 Failure classification

5.2.4 Failure detection

5.2.5 Failure mitigation

5.2.6 Putting everything together

5.3 Summary

Unable to load book!