Chapter 3. Expecting failure: fault tolerance in CoreOS
This chapter covers
- Monitoring and fault tolerance in CoreOS
- Getting your first complex service running
- Application architecture in the context of CoreOS
If you work in infrastructure or operations in any capacity, you’ll understand the importance of monitoring systems. When the alarms go off, it’s time to figure out what’s happened. You might have also taken a crack at automating some of the most common fixes to problems or mitigated situations with disaster-recovery failover switches, multicasting, or a variety of other ways to react to failure. You probably also have an understanding that technology always finds a way to break. Hardware, software, connectivity, power grid—these are all things that wake us up in the middle of the night. If you’ve been working in operations for a while, you probably have the sense that although automating fault tolerance is possible, it’s usually risky and difficult to maintain.
CoreOS tries to solve this problem; by providing generic abstractions for the state of your application distributed over a cluster, the implementation details of automating fault tolerance become much clearer and reusable. The next logical benefit of containers after abstracting the runtime from any particular machine is to allow that runtime to be portable across a network, thus decoupling any container from the failure of its host.