Chapter 16. Designing for fault tolerance
This chapter covers
- What is fault-tolerance and why do you need it?
- Using redundancy to remove single points of failure
- Retrying on failure
- Using idempotent operations to achieve retry on failure
- AWS service guarantees
Failure is inevitable: hard disks, networks, power, and so on all fail from time to time. Fault tolerance deals with that problem. A fault-tolerant architecture is built for failure. If a failure occurs, the system isn’t interrupted, and it continues to handle requests. If there is single point of failure within your architecture, it is not fault-tolerant. You can achieve fault-tolerance by introducing redundancy into your system and by decoupling the parts of your architecture such that one side does not rely on the uptime of the other.
- No guarantees (single point of failure)—No requests are served in case of failure.
- High availability—In case of failure, it takes some time until requests are served as before.
- Fault-tolerance—In case of failure, requests are served as before without any availability issues.