5 Failure tolerance
This chapter covers
- Failure and failure tolerance
- Failure detection and failure mitigation
- Application-level and platform-level failures
- Transient, intermittent, and permanent failures
- An ideal failure-handling strategy
Now that we have defined the notion of system models, discussed widely used system models like synchronous and asynchronous distributed systems, and explored the concepts of order, physical time, and logical time, we can explore failure, failure tolerance, and failure handling—in short, how to think about failure.
While reading this chapter, keep in mind that the primary objective of thinking about failure is to ensure failure tolerance, which refers to the guarantee that a distributed system functions in a well-defined manner even when failures occur.
Note
The terms fault, error, and failure are subject to considerable ambiguity. What one author may refer to as a fault, another may label as an error or a failure. Although some authors attempt to distinguish between these terms, there is no universally accepted definition. This book will simply use the term failure. By using this term, we aim to reduce confusion and provide clarity in our discussion of these complex systems. As a result, we will also use less frequently used terms, such as failure tolerance instead of the more common fault tolerance.