5 Failure tolerance

 

This chapter covers

  • Failure and failure tolerance
  • Failure detection and failure mitigation
  • Application-level and platform-level failures
  • Transient, intermittent, and permanent failures
  • An ideal failure-handling strategy

Now that we have defined the notion of system models, discussed widely used system models like synchronous and asynchronous distributed systems, and explored the concepts of order, physical time, and logical time, we can explore failure, failure tolerance, and failure handling—in short, how to think about failure.

While reading this chapter, keep in mind that the primary objective of thinking about failure is to ensure failure tolerance, which refers to the guarantee that a distributed system functions in a well-defined manner even when failures occur.

Note

The terms fault, error, and failure are subject to considerable ambiguity. What one author may refer to as a fault, another may label as an error or a failure. Although some authors attempt to distinguish between these terms, there is no universally accepted definition. This book will simply use the term failure. By using this term, we aim to reduce confusion and provide clarity in our discussion of these complex systems. As a result, we will also use less frequently used terms, such as failure tolerance instead of the more common fault tolerance.

5.1 In theory

 
 
 

5.1.1 Types of failure tolerance

 
 
 

5.2 In practice

 

5.2.1 System model

 
 
 
 

5.2.2 Failure handling

 

5.2.3 Failure classification

 

5.2.4 Failure detection

 

5.2.5 Failure mitigation

 
 

5.2.6 Putting everything together

 
 
 

5.3 Summary

 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest