13 Achieving High Availability: Availability zones, auto-scaling and CloudWatch

This chapter covers

Recovering a failed virtual machine with a CloudWatch alarm
Using auto-scaling to guarantee your virtual machines keep running
Understanding availability zones in an AWS region
Analyzing disaster-recovery requirements

Imagine you run an online shop. During the night, the hardware running your virtual machine fails. Until the next morning when you go into work, your users can no longer access your web shop. During the 8-hour downtime, your users search for an alternative and stop buying from you. That’s a disaster for any business. Now imagine a highly available web shop. Just a few minutes after the hardware failed, the system recovers, restarts itself on new hardware, and your e-commerce website is back online again—without any human intervention. Your users can now continue to shop on your site. In this chapter, we’ll teach you how to build a high-availability system based on EC2 instances like that.

Virtual machines are not highly available by default, the potential for system failure is always present. The following scenarios could cause an outage of your virtual machine:

13.1 Recovering from EC2 instance failure with CloudWatch

13.1.1 How does a CloudWatch alarm recover an EC2 instance?

13.2 Recovering from a data center outage with Auto Scaling Group

13.2.1 Availability zones: groups of isolated data centers

13.2.2 Recovering a failed virtual machine to another availability zone with the help of auto-scaling

13.2.3 Pitfall: recovering network-attached storage

13.2.4 Pitfall: network interface recovery

13.2.5 Insights into availability zones

13.3 Architecting for high availability

13.3.1 RTO and RPO comparison for a single EC2 instance

13.3.2 AWS services come with different high availability guarantees

13.4 Summary