Chapter 13. Designing for fault-tolerance


This chapter covers

  • What fault-tolerance is and why you need it
  • Using redundancy to remove single point of failures
  • Retrying on failure
  • Using idempotent operations to achieve retry on failure
  • AWS service guarantees

Failure is inevitable for hard disks, networks, power, and so on. Fault-tolerance deals with that problem. A fault-tolerant system is built for failure. If a failure occurs, the system isn’t interrupted, and it continues to handle requests. If your system has a single point of failure, it’s not fault-tolerant. You can achieve fault-tolerance by introducing redundancy into your system and by decoupling the parts of your system in such a way that one side doesn’t rely on the uptime of the other.

The most convenient way to make your system fault-tolerant is to compose the system of fault-tolerant blocks. If all blocks are fault-tolerant, the system is fault-tolerant as well. Many AWS services are fault-tolerant by default. If possible, use them. Otherwise you’ll need to deal with the consequences.

Unfortunately, one important service isn’t fault-tolerant by default: EC2 instances. A virtual server isn’t fault-tolerant. This means a system that uses EC2 isn’t fault-tolerant by default. But AWS provides the building blocks to deal with that issue. The solution consists of auto-scaling groups, Elastic Load Balancing (ELB), and SQS.

It’s important to differentiate among services that guarantee the following:

13.1. Using redundant EC2 instances to increase availability

13.2. Considerations for making your code fault-tolerant

13.3. Architecting a fault-tolerant web application: Imagery

13.4. Summary