16 Designing for fault tolerance

 

This chapter covers

  • What fault-tolerance is and why you need it
  • Using redundancy to remove single points of failure
  • Improving fault tolerance by retrying on failure
  • Using idempotent operations to retry on failure
  • AWS service guarantees

Failure is inevitable: hard disks, networks, power, and so on all fail from time to time. But failures do not have to affect the users of your system.

A fault-tolerant system provides the highest quality to your users. No matter what happens in your system, the user is never affected and can continue to go about their work, consume entertaining content, buy goods and services, or have conversations with friends. A few years ago, achieving fault tolerance was expensive and complicated, but with AWS, providing fault-tolerant systems is becoming an affordable standard. Nevertheless, building fault-tolerant systems is the top tier of cloud computing and might be challenging at the beginning.

16.1 Using redundant EC2 instances to increase availability

16.1.1 Redundancy can remove a single point of failure

16.1.2 Redundancy requires decoupling

16.2 Considerations for making your code fault tolerant

16.2.1 Let it crash, but also retry

16.2.2 Idempotent retry makes fault tolerance possible

16.3 Building a fault-tolerant web application: Imagery

16.3.1 The idempotent state machine

16.3.2 Implementing a fault-tolerant web service

16.3.3 Implementing a fault-tolerant worker to consume SQS messages

16.3.4 Deploying the application

Summary