Chapter 13. Designing for fault-tolerance
This chapter covers
- What fault-tolerance is and why you need it
- Using redundancy to remove single point of failures
- Retrying on failure
- Using idempotent operations to achieve retry on failure
- AWS service guarantees
Failure is inevitable for hard disks, networks, power, and so on. Fault-tolerance deals with that problem. A fault-tolerant system is built for failure. If a failure occurs, the system isn’t interrupted, and it continues to handle requests. If your system has a single point of failure, it’s not fault-tolerant. You can achieve fault-tolerance by introducing redundancy into your system and by decoupling the parts of your system in such a way that one side doesn’t rely on the uptime of the other.
The most convenient way to make your system fault-tolerant is to compose the system of fault-tolerant blocks. If all blocks are fault-tolerant, the system is fault-tolerant as well. Many AWS services are fault-tolerant by default. If possible, use them. Otherwise you’ll need to deal with the consequences.
Unfortunately, one important service isn’t fault-tolerant by default: EC2 instances. A virtual server isn’t fault-tolerant. This means a system that uses EC2 isn’t fault-tolerant by default. But AWS provides the building blocks to deal with that issue. The solution consists of auto-scaling groups, Elastic Load Balancing (ELB), and SQS.
It’s important to differentiate among services that guarantee the following: