chapter three

Chapter 3. Expecting failure: fault tolerance in CoreOS

This chapter covers

Monitoring and fault tolerance in CoreOS
Getting your first complex service running
Application architecture in the context of CoreOS

If you work in infrastructure or operations in any capacity, you’ll understand the importance of monitoring systems. When the alarms go off, it’s time to figure out what’s happened. You might have also taken a crack at automating some of the most common fixes to problems or mitigated situations with disaster-recovery failover switches, multicasting, or a variety of other ways to react to failure. You probably also have an understanding that technology always finds a way to break. Hardware, software, connectivity, power grid—these are all things that wake us up in the middle of the night. If you’ve been working in operations for a while, you probably have the sense that although automating fault tolerance is possible, it’s usually risky and difficult to maintain.

CoreOS tries to solve this problem; by providing generic abstractions for the state of your application distributed over a cluster, the implementation details of automating fault tolerance become much clearer and reusable. The next logical benefit of containers after abstracting the runtime from any particular machine is to allow that runtime to be portable across a network, thus decoupling any container from the failure of its host.

Chapter 3. Expecting failure: fault tolerance in CoreOS

This chapter covers

3.1. The current state of monitoring

3.2. Service scheduling and discovery

3.3. Breaking things

3.4. Application architectures and CoreOS

3.5. Summary