chapter nine

9 What could possibly go wrong?

This chapter covers

Enumerating potential failures
Exploring options for recovering from failures
Implementing task health checks to recover from task crashes

At the beginning of chapter 4, in which we started the process of implementing our worker, we talked about the scenario of running a web server that serves static pages. In that scenario, we considered how to deal with the problem of our site growing in popularity and thus needing to be resilient to failures to ensure we could serve our growing user base. The solution, we said, was to run multiple instances of our web server. In other words, we decided to scale horizontally, a common pattern for scaling. By scaling the number of web servers, we can ensure that a failure in any given instance of the web server does not bring our site completely down and, thus, unavailable to our users.

In this chapter, we’re going to modify this scenario slightly. Instead of serving static web pages, we’re going to serve an API. This API is very simple: it takes a POST request with a body, and it returns a response with the same body. In other words, it simply echoes the request in the response.

With that minor change to our scenario, this chapter will reflect on what we’ve built thus far and discuss a number of failure scenarios, both with our orchestrator and with the tasks running on it. Then we will implement several mechanisms for handling a subset of failure scenarios.

9.1 Overview of our new scenario

9.2 Failure scenarios

9.2.1 Application startup failure

9 What could possibly go wrong?

This chapter covers

9.1 Overview of our new scenario

9.2 Failure scenarios

9.2.1 Application startup failure

9.2.2 Application bugs

9.2.3 Task startup failures due to resource problems

9.2.4 Task failures due to Docker daemon crashes and restarts

9.2.5 Task failures due to machine crashes and restarts

9.2.6 Worker failures

9.2.7 Manager failures

9.3 Recovery options

9.3.1 Recovery from application failures

9.3.2 Recovering from environmental failures

9.3.3 Recovering from task-level failures