Chapter 16. Designing for fault tolerance

 

This chapter covers

  • What is fault-tolerance and why do you need it?
  • Using redundancy to remove single points of failure
  • Retrying on failure
  • Using idempotent operations to achieve retry on failure
  • AWS service guarantees

Failure is inevitable: hard disks, networks, power, and so on all fail from time to time. Fault tolerance deals with that problem. A fault-tolerant architecture is built for failure. If a failure occurs, the system isn’t interrupted, and it continues to handle requests. If there is single point of failure within your architecture, it is not fault-tolerant. You can achieve fault-tolerance by introducing redundancy into your system and by decoupling the parts of your architecture such that one side does not rely on the uptime of the other.

The services provided by AWS offer different types of failure resilience:

  • No guarantees (single point of failure)—No requests are served in case of failure.
  • High availability—In case of failure, it takes some time until requests are served as before.
  • Fault-tolerance—In case of failure, requests are served as before without any availability issues.

16.1. Using redundant EC2 instances to increase availability

16.2. Considerations for making your code fault-tolerant

16.3. Building a fault-tolerant web application: Imagery

Summary