chapter thirteen

Chapter 13. Designing for fault-tolerance

This chapter covers

What fault-tolerance is and why you need it
Using redundancy to remove single point of failures
Retrying on failure
Using idempotent operations to achieve retry on failure
AWS service guarantees

Failure is inevitable for hard disks, networks, power, and so on. Fault-tolerance deals with that problem. A fault-tolerant system is built for failure. If a failure occurs, the system isn’t interrupted, and it continues to handle requests. If your system has a single point of failure, it’s not fault-tolerant. You can achieve fault-tolerance by introducing redundancy into your system and by decoupling the parts of your system in such a way that one side doesn’t rely on the uptime of the other.

The most convenient way to make your system fault-tolerant is to compose the system of fault-tolerant blocks. If all blocks are fault-tolerant, the system is fault-tolerant as well. Many AWS services are fault-tolerant by default. If possible, use them. Otherwise you’ll need to deal with the consequences.

Unfortunately, one important service isn’t fault-tolerant by default: EC2 instances. A virtual server isn’t fault-tolerant. This means a system that uses EC2 isn’t fault-tolerant by default. But AWS provides the building blocks to deal with that issue. The solution consists of auto-scaling groups, Elastic Load Balancing (ELB), and SQS.

It’s important to differentiate among services that guarantee the following:

Chapter 13. Designing for fault-tolerance

This chapter covers

13.1. Using redundant EC2 instances to increase availability

13.2. Considerations for making your code fault-tolerant

13.3. Architecting a fault-tolerant web application: Imagery

13.4. Summary