Chapter 6. Fault tolerance with Supervisors

published book

This chapter covers

Using the OTP Supervisor behavior
Working with Erlang Term Storage (ETS)
Using Supervisors with normal processes and other OTP behaviors
Implementing a basic worker-pool application

In the previous chapter, you built a naïve Supervisor made from primitives provided by the Elixir language: monitors, links, and processes. You should now have a good understanding of how Supervisors work under the hood.

After teasing you in the previous chapter, in this chapter I’ll finally show you how to use the real thing: the OTP Supervisor behavior. The sole responsibility of a Supervisor is to observe an attached child process, check to see if it goes down, and take some action if that happens.

The OTP version offers a few more bells and whistles than your previous implementation of a Supervisor. Take restart strategies, for example, which dictate how a Supervisor should restart the children if something goes wrong. Supervisor also offers options for limiting the number of restarts within a specific timeframe; this is especially useful for preventing infinite restarts.

To really understand Supervisors, it’s important to try them for yourself. Therefore, instead of boring you with every single Supervisor option, I’ll walk you through building the worker-pool application shown in its full glory (courtesy of the Observer application) in figure 6.1.

Figure 6.1. The completed worker-pool application

In the figure, is the top-level Supervisor. It supervises another Supervisor (PoolsSupervisor) and a GenServer (Pooly.Server). PoolsSupervisor in turn supervises three other PoolSupervisors (one of them is marked ). These Supervisors have unique names. Each PoolSupervisor supervises a worker Supervisor (represented by its process id) and a GenServer . Finally, the workers do the grunt work. If you’re wondering what the GenServers are for, they’re primarily needed to maintain state for the Supervisor at the same level. For example, the GenServer at helps maintain the state for the Supervisor at .

6.1. Implementing Pooly: a worker-pool application

You’re going to build a worker pool over the course of two chapters. What is a worker pool? It’s something that manages a pool (surprise!) of workers. You might use a worker pool to manage access to a scarce resource. It could be a pool of Redis connections, web-socket connections, or even GenServer workers.

For example, suppose you spawn 1 million processes, and each process needs a connection to the database. It’s impractical to open 1 million database connections. To get around this, you can create a pool of database connections. Each time a process needs a database connection, it will issue a request to the pool. Once the process is done with the database connection, it’s returned to the pool. In effect, resource allocation is delegated to the worker-pool application.

The worker-pool application you’ll build is not trivial. If you’re familiar with the Poolboy library, much of its design has been adapted for this example. (No worries if you haven’t heard of or used Poolboy; it isn’t a prerequisite.)

This will be a rewarding exercise because it will get you thinking about concepts and issues that wouldn’t arise in simpler examples. You’ll get hands-on with the Supervisor API, too. As such, this example is slightly more challenging than the previous examples. Some of the code/design may not be obvious, but that’s mostly because you don’t have the benefit of hindsight. But fret not—I’ll guide you every step of the way. All I ask is that you work through the code by typing it on your computer; enlightenment will be yours by the end of chapter 7!

6.1.1. The plan

You’ll evolve the design of Pooly through four versions. This chapter covers the fundamentals of Supervisor and starts you building a basic version (version 1) of Pooly. Chapter 7 is completely focused on building Pooly’s various features. Table 6.1 lists the characteristics of each version of Pooly.

Table 6.1. The changes that Pooly will undergo across four versions (view table figure)

Version	Characteristics
1	Supports a single pool Supports a fixed number of workers No recovery when consumer and/or worker processes fail
2	Supports a single pool Supports a fixed number of workers Recovery when consumer and/or worker processes fail
3	Supports multiple pools Supports a variable number of workers
4	Supports multiple pools Supports a variable number of workers Variable-sized pool allows for worker overflow Queuing for consumer processes when all workers are busy

To give you an idea how the design will evolve, figure 6.2 illustrates versions 1 and 2, and figure 6.3 illustrates versions 3 and 4. Rectangles represent Supervisors, ovals represent GenServers, and circles represent the worker processes. From the figures, it should be obvious why it’s called a supervision tree.

Figure 6.2. Versions 1 and 2 of Pooly

Figure 6.3. Versions 3 and 4 of Pooly

6.1.2. A sample run of Pooly

Before we get into the actual coding, it’s instructive to see how to use Pooly. This section uses version 1.

Starting a pool

In order to start a pool, you must give it a pool configuration that provides the information needed for Pooly to initialize the pool:

pool_config = [
  mfa: {SampleWorker, :start_link, []},
  size: 5
]

This tells the pool to create five SampleWorkers. To start the pool, do this:

Pooly.start_pool(pool_config)

Checking out workers

In Pooly lingo, checking out a worker means requesting and getting a worker from the pool. The return value is a pid of an available worker:

worker_pid = Pooly.checkout

Once a consumer process has a worker_pid, the process can do whatever it wants with it. What happens if no more workers are available? For now, :noproc is returned. You’ll have more sophisticated ways of handling this in later versions.

Checking workers back into the pool

Once a consumer process is done with the worker, the process must return it to the pool, also known as checking in the worker. Checking in a worker is straightforward:

Pooly.checkin(worker_pid)

Getting the status of a pool

It’s helpful to get some useful information from the pool:

Pooly.status

For now, this returns a tuple such as {3, 2}. This means there are three free workers and two busy ones. That concludes our short tour of the API.

6.1.3. Diving into Pooly, version 1: laying the groundwork

Go to your favorite directory and create a new project with mix:

% mix new pooly


!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":2},{\"line\":0,\"ch\":15}]]"}
!@%STYLE%@!

Note

The source code for the different versions of this project has been split into branches. For example, to check out version 3, cd into the project folder and do a git checkout version-3.

mix and the --sup option

You may be aware that mix includes an option called --sup. This option generates an OTP application skeleton including a supervision tree. If this option is left out, the application is generated without a Supervisor and application callback. For example, you may be tempted to create Pooly like so:

% mix new pooly --sup
!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":2},{\"line\":0,\"ch\":21}]]"}
!@%STYLE%@!

But because you’re just learning, you’ll opt for the flagless version.

The first version of Pooly will support only a single pool of fixed workers. There will also be no recovery handling when either the consumer or the worker process fails. By the end of this version, Pooly will look like figure 6.4.

Figure 6.4. Pooly version 1

As you can see, the application consists of a top-level Supervisor (Pooly.Supervisor) that supervises two other processes: a GenServer process (Pooly.Server) and a worker Supervisor (Pooly.WorkerSupervisor). Recall from chapter 5 that Supervisors can themselves be supervised because Supervisors are processes.

How do I begin?

Whenever I’m designing an Elixir program that may have many supervision hierarchies, I always make a sketch first. That’s because (as you’ll find out soon) there are quite a few things to keep straight. Probably more so than in other languages, you must have a rough design in mind, which forces you to think slightly ahead.

Figure 6.5 illustrates how Pooly version 1 works. When it starts, only Pooly.Server is attached to Pooly.Supervisor . When the pool is started with a pool configuration, Pooly.Server first verifies that the pool configuration is valid.

Figure 6.5. How Pooly’s various components are initialized

After that, it sends a :start_worker_supervisor to Pooly.Supervisor . This message instructs Pooly.Supervisor to start Pooly.WorkerSupervisor. Finally, Pooly .WorkerSupervisor is told to start a number of worker processes based on the size specified in the pool configuration .

6.2. Implementing the worker Supervisor

You’ll first create a worker Supervisor. This Supervisor is in charge of monitoring all the spawned workers in the pool. Create worker_supervisor.ex in lib/pooly. Just like a GenServer behavior (or any other OTP behavior, for that matter), you use the Supervisor behavior like this:

defmodule Pooly.WorkerSupervisor do
  use Supervisor
end

Listing 6.1 defines the good old start_link/1 function that serves as the main entry point when creating a Supervisor process. This start_link/1 function is a wrapper function that calls Supervisor.start_link/2, passing in the module name and the arguments.

Like GenServer, when you define Supervisor.start_link/2, you should next implement the corresponding init/1 callback function. The arguments passed to Supervisor.start_link/2 are then passed to the init/1 callback.

Listing 6.1. Validating and destructuring arguments (lib/pooly/worker_supervisor.ex)

defmodule Pooly.WorkerSupervisor do
  use Supervisor
  #######
  # API # 
  #######
  def start_link({_,_,_} = mfa) do                    #1
    Supervisor.start_link(__MODULE__, mfa)
  end
  #############
  # Callbacks #
  #############
  def init({m,f,a}) do                                #2
    # …
  end
end

You first declare that start_link takes a three-element tuple : the module, a function, and a list of arguments of the worker process. Notice the beauty of pattern matching at work here. Saying {_,_,_} = mfa essentially does two things. First, it asserts that the input argument must be a three-element tuple. Second, the input argument is referenced by mfa. You could have written it as {m,f,a}. But because you aren’t using the individual elements, you pass along the entire tuple using mfa.

mfa is then passed along to the init/1 callback. This time, you need to use the individual elements of the tuple, so you assert that the expected input argument is {m,f,a} . The init/1 callback is where the actual initialization occurs.

6.2.1. Initializing the Supervisor

Let’s take a closer look at the init/1 callback in the next listing, where most of the interesting bits happen in a Supervisor.

Listing 6.2. Initializing the `Supervisor` (lib/pooly/worker_supervisor.ex)

defmodule Pooly.WorkerSupervisor do
  #############
  # Callbacks #
  #############
  def init({m,f,a} = x) do
    worker_opts = [restart: :permanent,                       #1
                  function: f]                                #2
    children = [worker(m, a, worker_opts)]                    #3
    opts     = [strategy: :simple_one_for_one,                #4
                max_restarts: 5,                              #4
                max_seconds: 5]                               #4
    supervise(children, opts)                                 #5
  end
end

Let’s decipher the listing. In order for a Supervisor to initialize its children, you must give it a child specification. A child specification (covered briefly in chapter 5) is a recipe for the Supervisor to spawn its children.

The child specification is created with Supervisor.Spec.worker/3. The Supervisor .Spec module is imported by the Supervisor behavior by default, so there’s no need to supply the fully qualified version.

The return value of the init/1 callback must be a supervisor specification. In order to construct a supervisor specification, you use the Supervisor.Spec.supervise/2 function.

supervise/2 takes two arguments: a list of children and a keyword list of options. In listing 6.2, these are represented by children and opts, respectively. Before you get into defining children, let’s discuss the second argument to supervise/2.

6.2.2. Supervision options

The example defines the following options to supervise/2:

opts = [strategy: :simple_one_for_one,
        max_restarts: 5,
        max_seconds: 5]

You can set a few options here. The most important is the restart strategy, which we’ll look at next.

6.2.3. Restart strategies

Restart strategies dictate how a Supervisor restarts a child/children when something goes wrong. In order to define a restart strategy, you include a strategy key. There are four kinds of restart strategies:

:one_for_one
:one_for_all
:rest_for_one
:simple_one_for_one

Let’s take a quick look at each of them.

:one_for_one

If the process dies, only that process is restarted. None of the other processes are affected.

:one_for_all

Just like the Three Musketeers, if any process dies, all the processes in the supervision tree die along with it. After that, all of them are restarted. This strategy is useful if all the processes in the supervise tree depend on each other.

:rest_for_one

If one of the processes dies, the rest of the processes that were started after that process are terminated. After that, the process that died and the rest of the child processes are restarted. Think of it like dominoes arranged in a circular fashion.

:simple_one_for_one

The previous three strategies are used to build a static supervision tree. This means the workers are specified up front via the child specification.

In :simple_one_for_one, you specify only one entry in the child specification. Every child process that’s spawned from this Supervisor is the same kind of process.

The best way to think about the :simple_one_for_one strategy is like a factory method (or a constructor in OOP languages), where the workers that are produced are alike. :simple_one_for_one is used when you want to dynamically create workers.

The Supervisor initially starts out with empty workers. Workers are then dynamically attached to the Supervisor. Next, let’s look at the other options that allow you to fine-tune the behavior of Supervisors.

6.2.4. max_restarts and max_seconds

max_restarts and max_seconds translate to the maximum number of restarts the Supervisor can tolerate within a maximum number of seconds before it gives up and terminates. Why have these options? The main reason is that you don’t want your Supervisor to infinitely restart its children when something is genuinely wrong (such as a programmer error). Therefore, you may want to specify a threshold at which the Supervisor should give up. Note that by default, max_restarts and max_seconds are set to 3 and 5 respectively. In listing 6.2, you specify that the Supervisor should give up if there are more than five restarts within five seconds.

6.2.5. Defining children

It’s now time to learn how to define children. In the example code, the children are specified in a list:

children = [worker(m, a, worker_opts)]

What does this tell you? It says that this Supervisor has one child, or one kind of child in the case of a :simple_one_for_one restart strategy. (It doesn’t make sense to define multiple workers when in general you don’t know how many workers you want to spawn when using a :simple_one_for_one restart strategy.)

The worker/3 function creates a child specification for a worker, as opposed to its sibling supervisor/3. This means if the child isn’t a Supervisor, you should use worker/3. If you’re supervising a Supervisor, then use supervisor/3. You’ll use both variants shortly.

Both variants take the module, arguments, and options. The first two are exactly what you’d expect. The third argument is more interesting.

Child specification default options

When you leave out the options

children = [worker(m, a)]

Elixir will supply the following options by default:

[id: module,
 function: :start_link,
 restart: :permanent,
 shutdown: :infinity,
 modules: [module]]

function should be obvious—It’s the f of mfa. Sometimes a worker’s main entry point is some function other than start_link. This is the place to specify the custom function to be called.

You’ll use two restart values throughout the Pooly application:

:permanent—The child process is always restarted.
:temporary—The child process is never restarted.

In worker_opts, you specify :permanent. This means any crashed worker is always restarted.

Creating a sample worker

To test this, you need a sample worker. Create sample_worker.ex in lib/pooly and fill it with the code in the following listing.

Listing 6.3. Worker used to test Pooly (lib/pooly/sample_worker.ex)

defmodule SampleWorker do
  use GenServer

  def start_link(_) do
    GenServer.start_link(__MODULE__, :ok, [])
  end

  def stop(pid) do
    GenServer.call(pid, :stop)
  end

  def handle_call(:stop, _from, state) do
    {:stop, :normal, :ok, state}
  end
end

SampleWorker is a simple GenServer that does little except have functions that control its lifecycle:

iex> {:ok, worker_sup} = Pooly.WorkerSupervisor.start_link({SampleWorker,
:start_link, []})


!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":73}]]"}
!@%STYLE%@!

Now you can create a child:

iex> Supervisor.start_child(worker_sup, [[]])


!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":45}]]"}
!@%STYLE%@!

The return value is a two-element tuple that looks like {:ok, #PID<0.132.0>}.

Add a few more children to the Supervisor. Next, let’s see all the children that the worker Supervisor is supervising, using Supervisor.which_children/1:

iex> Supervisor.which_children(worker_sup)


!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":42}]]"}
!@%STYLE%@!

The result is a list that looks like this:

[{:undefined, #PID<0.98.0>, :worker, [SampleWorker]},
 {:undefined, #PID<0.101.0>, :worker, [SampleWorker]}]

You can also count the number of children:

iex> Supervisor.count_children(worker_sup)


!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":42}]]"}
!@%STYLE%@!

The return result should be self-explanatory:

%{active: 2, specs: 1, supervisors: 0, workers: 2}

Now to see the Supervisor in action! Create another child, but this time, save a reference to it:

iex> {:ok, worker_pid} = Supervisor.start_child(worker_sup, [[]])


!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":65}]]"}
!@%STYLE%@!

Supervisor.which_children(worker_sup) should look like this:

[{:undefined, #PID<0.98.0>, :worker, [SampleWorker]},
{:undefined, #PID<0.101.0>, :worker, [SampleWorker]},

{:undefined, #PID<0.103.0>, :worker, [SampleWorker]}]

Stop the worker you just created:

iex> SampleWorker.stop(worker_pid)


!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":34}]]"}
!@%STYLE%@!

Let’s inspect the state of the worker Supervisor’s children:

iex(8)> Supervisor.which_children(worker_sup)
[{:undefined, #PID<0.98.0>, :worker, [SampleWorker]},
 {:undefined, #PID<0.101.0>, :worker, [SampleWorker]},
 {:undefined, #PID<0.107.0>, :worker, [SampleWorker]}]


!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":8},{\"line\":0,\"ch\":45}]]"}
!@%STYLE%@!

Whoo-hoo! The Supervisor automatically restarted the stopped worker! I still get a warm, fuzzy feeling whenever a Supervisor restarts a failed child automatically. Getting something similar in other languages usually require a lot more work. Next, we’ll look at implementing Pooly.Server.

6.3. Implementing the server: the brains of the operation

In this section, you’ll work on the brains of the application. In general, you want to leave the Supervisor with as little logic as possible because less code means a smaller chance of things breaking.

Therefore, you’ll introduce a GenServer process that will handle most of the interesting logic. The server process must communicate with both the top-level Supervisor and the worker Supervisor. One way is to use named processes, as shown in figure 6.6.

Figure 6.6. Named processes allow other processes to reference them by name.

In this case, both processes can refer to each other by their respective names. But a more general solution is to have the server process contain a reference to the top-level Supervisor and the worker Supervisor as part of its state (see figure 6.7). Where will the server get references to both supervisors? When the top-level Supervisor starts the server, the Supervisor can pass its own pid to the server. This is exactly what you’ll do when you get to the implementation of the top-level Supervisor.

Figure 6.7. A reference to the supervisor is stored in the state of the Pooly server.

Now, because the server has a reference to the top-level Supervisor, the server can tell it to start a child using the Pooly.WorkerSupervisor module. The server will pass in the relevant bits of the pool configuration and Pooly.WorkerSupervisor will handle the rest.

The server process also maintains the state of the pool. You already know that the server has to store references to the top-level Supervisor and the worker Supervisor. What else should it store? For starters, it needs to store details about the pool, such as what kind of workers to create and how many of them. The pool configuration provides this information.

6.3.1. Pool configuration

The server accepts a pool configuration that comes in a keyword list. In this version, an example pool configuration looks like this:

[mfa: {SampleWorker, :start_link, []}, size: 5]

As I mentioned earlier, the key mfa stands for module, function, and list of arguments of the pool of worker(s) to be created. size is the number of worker processes to create.

Enough jibber-jabber^[1]— let’s see some code! Create a file called server.ex, and place it in lib/pooly.

¹ This was written with the voice of Mr. T in mind.

For now, you’ll make Pooly.Server a named process, which means you can reference the server process using the module name (Pooly.Server.status instead of Pool.Server.status(pid)). The next listing shows how this is done.

Listing 6.4. Starting the server process with `sup` and `pool_config` (lib/pooly/servertex)

defmodule Pooly.Server do
  use GenServer
  import Supervisor.Spec

  #######
  # API #
  #######

  def start_link(sup, pool_config) do
    GenServer.start_link(__MODULE__, [sup, pool_config], name: __MODULE__)
  end

end

The server process needs both the reference to the top-level Supervisor process and the pool configuration, which you pass in as [sup, pool_config]. Now you need to implement the init/1 callback. The init/1 callback has two responsibilities: validating the pool configuration and initializing the state, as all good init callbacks do.

6.3.2. Validating the pool configuration

A valid pool configuration looks like this:

[mfa: {SampleWorker, :start_link, []}, size: 5]

This is a keyword list with two keys, mfa and size. Any other key will be ignored. As the function goes through the pool-configuration keyword list, the state is gradually built up, as shown in the next listing.

Listing 6.5. Setting up the server state (lib/pooly/server.ex)

defmodule Pooly.Server do
  use GenServer
  defmodule State do                                      #1
    defstruct sup: nil, size: nil, mfa: nil
  end
  #############
  # Callbacks #
  #############
  def init([sup, pool_config]) when is_pid(sup) do        #2
    init(pool_config, %State{sup: sup})
  end
  def init([{:mfa, mfa}|rest], state) do                  #3
    init(rest,  %{state | mfa: mfa})
  end
  def init([{:size, size}|rest], state) do                #4
    init(rest, %{state | size: size})
  end
  def init([_|rest], state) do                            #5
    init(rest, state)
  end
  def init([], state) do                                  #6
    send(self, :start_worker_supervisor)                  #7
    {:ok, state}
  end
end

This listing sets up the state of the server. First you declare a struct that serves as a container for the server’s state . Next is the callback when GenServer.start _link/3 is invoked .

The init/1 callback receives the pid of the top-level Supervisor along with the pool configuration. It then calls init/2, which is given the pool configuration along with a new state that contains the pid of the top-level Supervisor.

Each element in a keyword list is represented by a two-element tuple, where the first element is the key and the second element is the value. For now, you’re interested in remembering the mfa and size values of the pool configuration (, ). If you want to add more fields to the state, you add more function clauses with the appropriate pattern. You ignore any options that you don’t care about .

Finally, once you’ve gone through the entire list , you expect that the state has been initialized. Remember that one of the valid return values of init/1 is {:ok, state}. Because init/1 calls init/2, and the empty list case is the last function clause invoked, it should return {:ok, state}.

What is the curious-looking line at ? Once you reach , you’re confident that the state has been built. That’s when you can start the worker Supervisor that you implemented previously. The server process is sending a message to itself. Because send/2 returns immediately, the init/1 callback isn’t blocked. You don’t want init/1 to time out, do you?

The number of init/1 functions can look overwhelming, but don’t fret. Individually, each function is as small as it gets. Without pattern matching in the function arguments, you’d need to write a large conditional to capture all the possibilities.

6.3.3. Starting the worker Supervisor

When the server process sends a message to itself using send/2, the message is handled using handle_info/2, as shown in the next listing.

Listing 6.6. Callback handler to start the worker `Supervisor` (lib/pooly/server.ex)

defmodule Pooly.Server do
  defstruct sup: nil, worker_sup: nil, size: nil, workers: nil, mfa: nil
  #############
  # Callbacks #
  #############
  def handle_info(:start_worker_supervisor, state = %{sup: sup, mfa: mfa, size: size}) do
    {:ok, worker_sup} = Supervisor.start_child(sup, supervisor_spec(mfa)) #1
    workers = prepopulate(size, worker_sup)                               #2
    {:noreply, %{state | worker_sup: worker_sup, workers: workers}}       #3
  end
  #####################
  # Private Functions #
  #####################
  defp supervisor_spec(mfa) do
    opts = [restart: :temporary]                        
    supervisor(Pooly.WorkerSupervisor, [mfa], opts)                       #4
  end
end

There’s quite a bit going on in this listing. Because the state of the server process contains the top-level Supervisor pid (sup), you invoke Supervisor.start_child/2 with the Supervisor pid and a Supervisor specification . After that, you pass the pid of the newly created worker Supervisor pid (worker_sup) and use it to start size number of workers . Finally, you update the state with the worker Supervisor pid and newly created workers .

You return a tuple with the worker Supervisor pid as the second element . The Supervisor specification consists of a worker Supervisor as a child . Notice that instead of

worker(Pooly.WorkerSupervisor, [mfa], opts)

you use the Supervisor variant:

supervisor(Pooly.WorkerSupervisor, [mfa], opts)

Here, you pass in restart: :temporary as the Supervisor specification. This means the top-level Supervisor won’t automatically restart the worker Supervisor. This seems a bit odd. Why? The reason is that you want to do something more than have the Supervisor restart the child. Because you want some custom recovery rules, you turn off the Supervisor’s default behavior of automatically restarting downed workers with restart: :temporary.

Note that this version doesn’t deal with worker recovery if crashes occur. The later versions will fix this. Let’s deal with prepopulating workers next.

6.3.4. Prepopulating the worker Supervisor with workers

Given a size option in the pool configuration, the worker Supervisor can prepopulate itself with a pool of workers. The prepopulate/2 function in the following listing takes a size and the worker Supervisor pid and builds a list of size number of workers.

Listing 6.7. Prepopulating the worker `Supervisor` (lib/pooly/server.ex)

defmodule Pooly.Server do
  #####################
  # Private Functions #
  #####################
  defp prepopulate(size, sup) do
    prepopulate(size, sup, [])
  end
  defp prepopulate(size, _sup, workers) when size < 1 do
    workers
  end
  defp prepopulate(size, sup, workers) do
    prepopulate(size-1, sup, [new_worker(sup) | workers])              #1
  end
  defp new_worker(sup) do
    {:ok, worker} = Supervisor.start_child(sup, [[]])                  #2
    worker
  end
end

6.3.5. Creating a new worker process

The new_worker/1 function in listing 6.7 is worth a look. Here, you use Supervisor .start_child/2 again to spawn the worker processes. Instead of passing in a child specification, you pass in a list of arguments.

The two flavors of Supervisor.start_child/2

There are two flavors of Supervisor.start_child/2. The first takes a child specification:

{:ok, sup} = Supervisor.start_child(sup, supervisor_spec(mfa))

The other flavor takes a list of arguments:

{:ok, worker} = Supervisor.start_child(sup, [[]])

Which flavor should you use? Pooly.WorkerSupervisor uses a :simple_one _for_one restart strategy. This means the child specification has already been predefined, which means the first flavor is out—the second one is what you want.

The second version lets you pass additional arguments to the worker. Under the hood, the arguments defined in the child specification when creating Pooly .WorkerSupervisor are concatenated on the list passed in to the Supervisor.start_child/2, and the result is then passed along to the worker process during initialization.

The return result of new_worker/2 is the pid of the newly created worker. You haven’t yet implemented a way to get a worker out of a pool or put a worker back into the pool. These two actions are also known as checking out and checking in a worker, respectively. But before you do that, we need to take a brief detour and talk about ETS.

Just enough ETS

In this chapter and the next, you’ll use Erlang Term Storage (ETS). This sidebar will give you just enough background to understand the ETS-related code in this chapter and the next.

ETS is in essence a very efficient in-memory database built specially to store Erlang/Elixir data. It can store large amounts of data without breaking a sweat. Data access is also done in constant time. It comes free with Erlang, which means you have to use :ets to access it from Elixir.

CREATING A NEW ETS TABLE

You create a table using :ets.new/2. Let’s create a table to store my Mum’s favorite artists, their date of birth, and the genre in which they perform:

iex> :ets.new(:mum_faves, [])
12308
!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":29}]]"}
!@%STYLE%@!

The most basic form takes an atom representing the name of the table and an empty list of options. The return value of :ets.new/2 is a table ID, which is akin to a pid. The process that created the ETS table is called the owner process. In this case, the iex process is the owner. The most common options are related to the ETS table’s type, its access rights, and whether it’s named.

ETS TABLE TYPES

ETS tables come in four flavors:

:set—The default. Its characteristics are the set data structure you may have learned about in CS101 (unordered, with each unique key mapping to an element).
:ordered_set—A sorted version of :set.
:bag—Rows with the same keys are allowed, but the rows must be different.
:duplicate_bag—Same as :bag but without the row-uniqueness restriction.

In this chapter and the next, you’ll use :set, which essentially means you don’t have to specify the table type in the list of options. If you wanted to be specific, you’d create the table like so:

iex> :ets.new(:mum_faves, [:set])
!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":33}]]"}
!@%STYLE%@!

ACCESS RIGHTS

Access rights control which processes can read from and write to the ETS table. There are three options:

:protected—The owner process has full read and write permissions. All other processes can only read from the table. This is the default.
:public—There are no restrictions on reading and writing.
:private—Only the owner process can read from and write to the table.

You’ll use :private tables in this chapter because you’ll be storing pool-related data that other pools have no business knowing about. Let’s say my Mum is shy about her eclectic music tastes, and she wants to make the table private:

iex> :ets.new(:mum_faves, [:set, :private])
!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":43}]]"}
!@%STYLE%@!

NAMED TABLES

When you created the ETS table, you supplied an atom. This is slightly misleading because you can’t use :mum_faves to refer to the table without supplying the :named_table option. Therefore, to use :mum_faves instead of an unintelligible reference like 12308, you can do this:

iex> :ets.new(:mum_faves, [:set, :private, :named_table])
        :mum_faves
!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":57}]]"}
!@%STYLE%@!

Note that if you try to run this line again, you’ll get

iex> :ets.new(:mum_faves, [:set, :private, :named_table])
   ** (ArgumentError) argument error
            (stdlib) :ets.new(:mum_faves, [:set, :private, :named_table])
!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":57}],[{\"line\":2,\"ch\":21},{\"line\":2,\"ch\":73}]]"}
!@%STYLE%@!

That’s because names should be a unique reference to an ETS table.

INSERTING AND DELETING DATA

You insert data using the :ets.insert/2 function. The first argument is the table identifier (the number or the name), and the second is the data. The data comes in the form of a tuple, where the first element is the key and the second can be any arbitrarily nested term. Here are a few of Mum’s favorites:

iex> :ets.insert(:mum_faves, {"Michael Bolton", 1953, "Pop"})
true
iex> :ets.insert(:mum_faves, {"Engelbert Humperdinck", 1936, "Easy
            Listening"})
true
iex> :ets.insert(:mum_faves, {"Justin Beiber", 1994, "Teen"})
true
iex> :ets.insert(:mum_faves, {"Jim Reeves", 1923, "Country"})
true
iex> :ets.insert(:mum_faves, {"Cyndi Lauper", 1953, "Pop"})
true
!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":61}],[{\"line\":2,\"ch\":5},{\"line\":2,\"ch\":66}],[{\"line\":3,\"ch\":12},{\"line\":3,\"ch\":24}],[{\"line\":5,\"ch\":5},{\"line\":5,\"ch\":61}],[{\"line\":7,\"ch\":5},{\"line\":7,\"ch\":61}],[{\"line\":9,\"ch\":5},{\"line\":9,\"ch\":59}]]"}
!@%STYLE%@!

You can look at what’s in the table using :ets.tab2list/1:

iex> :ets.tab2list(:mum_faves)
[{"Michael Bolton", 1953, "Pop"},
 {"Cyndi Lauper", 1953, "Pop"},
 {"Justin Beiber", 1994, "Teen"},
 {"Engelbert Humperdinck", 1936, "Easy Listening"},
 {"Jim Reeves", 1923, "Country"}]
!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":30}]]"}
!@%STYLE%@!

Note that the return result is a list, and the elements in the list are unordered. All right, I lied. My Mum isn’t really a Justin Beiber fan.^[a] Let’s rectify this:

^a She isn’t a Cyndi Lauper fan, either, but I was listening to “Girls Just Want to Have Fun” while writing this.

iex> :ets.delete(:mum_faves, "Justin Beiber")
true
!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":45}]]"}
!@%STYLE%@!

LOOKING UP DATA

A table is of no use if you can’t retrieve data. The simplest way to do that is to use the key. What’s Michael Bolton’s birth year? Let’s find out:

iex> :ets.lookup(:mum_faves, "Michael Bolton")
[{"Michael Bolton", 1953, "Pop"}]
!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":46}]]"}
!@%STYLE%@!

Why is the result a list? Recall that ETS supports other types, such as :duplicate_bag, which allows for duplicated rows. Therefore, the most general data structure to represent this is the humble list.

What if you want to search by the year instead? You can use :ets.match/2:

iex> :ets.match(:mum_faves, {:"$1", 1953, :"$2"})
[["Michael Bolton", "Pop"], ["Cyndi Lauper", "Pop"]]
!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":49}]]"}
!@%STYLE%@!

You pass in a pattern, which looks slightly strange at first. Because you’re only querying using the year, you use :"$N″ as a placeholder, where N is an integer. This corresponds to the order in which the elements in each matching result are presented. Let’s swap the placeholders:

iex> :ets.match(:mum_faves, {:"$2", 1953, :"$1"})
[["Pop", "Michael Bolton"], ["Pop", "Cyndi Lauper"]]
!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":49}]]"}
!@%STYLE%@!

You can clearly see that the genre comes before the artist name. What if you only cared about returning the artist? You can use an underscore to omit the genre:

iex> :ets.match(:mum_faves, {:"$1", 1953, :"_"})
[["Michael Bolton"], ["Cyndi Lauper"]]
!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":48}]]"}
!@%STYLE%@!

There’s much more to learn about ETS, but this is all the information you need to understand the ETS bits of the code in this book.

6.3.6. Checking out a worker

When a consumer process checks out a worker from the pool, you need to handle a few key logistical issues:

What is the pid of the consumer process?
Which worker pid is the consumer process using?

The consumer process needs to be monitored by the server because if it dies, the server process must know about it and take recovery action. Once again, you aren’t implementing the recovery code yet, but laying the groundwork.

You also need to know which worker is assigned to which consumer process so that you can pinpoint which consumer process used which worker pid. Th next listing shows the implementation of checking out workers.

Listing 6.8. Checking out a worker (lib/pooly/server.ex)

defmodule Pooly.Server do
  #######
  # API #
  #######
  def checkout do
    GenServer.call(__MODULE__, :checkout)
  end
  #############
  # Callbacks #
  #############
  def handle_call(:checkout, {from_pid, _ref}, %{workers: workers, monitors: monitors} = state) do                                                       #1
    case workers do                                                   #2
      [worker|rest] ->
        ref = Process.monitor(from_pid)                               #3
        true = :ets.insert(monitors, {worker, ref})                   #4
        {:reply, worker, %{state | workers: rest}}
      [] ->
        {:reply, :noproc, state}
    end
  end
end

You use an ETS table to store the monitors. The implementation of the callback function is interesting. There are two cases to handle: either you have workers left that can be checked out , or you don’t. In the latter case, you return {:reply, :noproc, state}, signifying that no processes are available. In most examples about GenServers, you see that the from parameter is ignored:

def handle_call(:checkout, _from, state) do
  # ...
end

In this instance, from is very useful. Note that from is a two-element tuple consisting of the client pid and a tag (a reference). At , you care only about the pid of the client. You use the pid of the client (from_pid) and get the server process to monitor it . Then you use the resulting reference and add it to the ETS table . Finally, the state is updated with one less worker.

You now need to update the init/1 callback, as shown in the next listing, because you’ve introduced a new monitors field to store the ETS table.

Listing 6.9. Storing a reference to the ETS table (lib/pooly/server.ex)

defmodule Pooly.Server do
  #############
  # Callbacks #
  #############
  def init([sup, pool_config]) when is_pid(sup) do
    monitors = :ets.new(:monitors, [:private])                       #1
    init(pool_config, %State{sup: sup, monitors: monitors})          #1
  end
end

6.3.7. Checking in a worker

The reverse of checking out a worker is (wait for it) checking in a worker. The implementation shown in the next listing is the reverse of listing 6.8.

Listing 6.10. Checking in a worker (lib/pooly/server.ex)

defmodule Pooly.Server do

  #######

  # API #
  #######

  def checkin(worker_pid) do
    GenServer.cast(__MODULE__, {:checkin, worker_pid})
  end

  #############
  # Callbacks #
  #############

  def handle_cast({:checkin, worker}, %{workers: workers, monitors:
monitors} = state) do
    case :ets.lookup(monitors, worker) do
      [{pid, ref}] ->
        true = Process.demonitor(ref)
        true = :ets.delete(monitors, pid)
        {:noreply, %{state | workers: [pid|workers]}}
      [] ->
        {:noreply, state}
    end
  end

end

Given a worker pid (worker), the entry is searched for in the monitors ETS table. If an entry isn’t found, nothing is done. If an entry is found, then the consumer process is de-monitored, the entry is removed from the ETS table, and the workers field of the server state is updated with the addition of the worker’s pid.

6.3.8. Getting the pool’s status

You want to have some insight into your pool. That’s simple enough to implement, as the following listing shows.

Listing 6.11. Getting the status of the pool (lib/pooly/server.ex)

defmodule Pooly.Server do

  #######
  # API #
  #######

  def status do
    GenServer.call(__MODULE__, :status)
  end

  #############
  # Callbacks #
  #############

  def handle_call(:status, _from, %{workers: workers, monitors: monitors} =
state) do
    {:reply, {length(workers), :ets.info(monitors, :size)}, state}
  end

end

This gives you some information about the number of workers available and the number of checked out (busy) workers.

6.4. Implementing the top-level Supervisor

There’s one last piece to write before you can claim that version 1 is feature complete.^[2] Create supervisor.ex in lib/pooly; this is the top-level Supervisor. The full implementation is shown in the next listing.

² A rare occurrence in the software industry.

Listing 6.12. Top-level `Supervisor` (lib/pooly/supervisor.ex)

defmodule Pooly.Supervisor do
  use Supervisor

  def start_link(pool_config) do
    Supervisor.start_link(__MODULE__, pool_config)
  end

  def init(pool_config) do
    children = [
      worker(Pooly.Server, [self, pool_config])
    ]

    opts = [strategy: :one_for_all]

    supervise(children, opts)
  end

end

As you can see, the structure of Pooly.Supervisor is similar to Pooly.WorkerSupervisor. The start_link/1 function takes pool_config. The init/1 callback receives the pool configuration.

The children list consists of Pooly.Server. Recall that Pooly.Server.start _link/2 takes two arguments: the pid of the top-level Supervisor process (the one you’re working on now) and the pool configuration.

What about the worker Supervisor? Why aren’t you supervising it? It should be clear that because the server starts the worker Supervisor, it isn’t included here at first.

The restart strategy you use here is :one_for_all. Why not, say, :one_for_one? Think about it for a moment. What happens when the server crashes? It loses all of its state. When the server process restarts, the state is essentially a blank slate. Therefore, the state of the server is inconsistent with the actual pool state.

What happens if the worker Supervisor crashes? The pid of the worker Supervisor will be different, along with the worker processes. Once again, the state of the server is inconsistent with the actual pool state.

There’s a dependency between the server process and the worker Supervisor. If either goes down, it should take the other down with it—hence the :one_for_all restart strategy.

6.5. Making Pooly an OTP application

Create a file called pooly.ex in lib. You’ll be creating an OTP application, which serves as an entry point to Pooly. It will also contain convenience functions such as start_pool/1 so that clients can say Pooly.start_pool/2 instead of Pooly.Server.start_pool/2. First, add the code in the following listing to pooly.ex.

Listing 6.13. Pooly application (lib/pooly.ex)

defmodule Pooly do
  use Application

  def start(_type, _args) do
    pool_config = [mfa: {SampleWorker, :start_link, []}, size: 5]
    start_pool(pool_config)
  end

  def start_pool(pool_config) do
    Pooly.Supervisor.start_link(pool_config)
  end

  def checkout do
    Pooly.Server.checkout
  end

  def checkin(worker_pid) do
    Pooly.Server.checkin(worker_pid)
  end

  def status do
    Pooly.Server.status
  end

end

Pooly uses an OTP Application behavior. What you’ve done here is specify start/2, which is called first when Pooly is initialized. You predefine a pool configuration and a call to start_pool/1 out of convenience.

6.6. Taking Pooly for a spin

First, open mix.exs, and modify application/0:

defmodule Pooly.Mixfile do
  use Mix.Project
  def project do
    [app: :pooly,
     version: "0.0.1",
     elixir: "~> 1.0",
     build_embedded: Mix.env == :prod,
     start_permanent: Mix.env == :prod,
     deps: deps]
  end
  def application do
    [applications: [:logger],
              mod: {Pooly, []}]                #1
  end
  defp deps do
    []
  end
end

Next, head to the project directory and launch iex:

% iex -S mix


!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":2},{\"line\":0,\"ch\":12}]]"}
!@%STYLE%@!

Fire up Observer:

iex> :observer.start


!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":0,\"ch\":5},{\"line\":0,\"ch\":20}]]"}
!@%STYLE%@!

Select the Applications tab and you’ll see something similar to figure 6.8.

Figure 6.8. Version 1 of Pooly as seen in Observer

Let’s start by killing a worker. (I hope you aren’t reading this book aloud!) You can do this by right-clicking a worker process and selecting Kill Process, as shown in figure 6.9.

Figure 6.9. Killing a worker in Observer

The Supervisor spawns a new worker in the killed process’s place (see figure 6.10). More important, the crash/exit of a single worker doesn’t affect the rest of the supervision tree. In other words, the crash of that single worker is isolated to that worker and doesn’t affect anything else.

Figure 6.10. The `Supervisor` replaced a killed worker with a newly spawned worker.

Now, what happens if you kill Pooly.Server? Once again, right-click Pooly.Server and select Kill Process, as shown in figure 6.11.

Figure 6.11. Killing the server process in Observer

This time, all the processes are killed and the top-level Supervisor restarts all of its child processes (see figure 6.12). Why does killing Pooly.Server cause everything under the top-level Supervisor to die? The mere description of the effect should yield an important clue. What’s the restart strategy of the top-level Supervisor?

Figure 6.12. Killing the server restarted all the processes under the top-level `Supervisor`.

Let’s jolt your memory a little:

defmodule Pooly.Supervisor do

  def init(pool_config) do
    # ...
    opts = [strategy: :one_for_all]

    supervise(children, opts)
  end

end

The :one_for_all restart strategy explains why killing Pooly.Server brings down (and restarts) the rest of the children.

6.7. Exercises

Take the following exercises for a spin:

1. What happens when you kill the WorkerSupervisor process in Observer? Can you explain why that happens?

2. Shut down and restart some values. Play around with the various shutdown and restart values. For example, in Pooly.WorkerSupervisor, try changing opts from

opts = [strategy: :simple_one_for_one, max_restarts: 5, max_seconds: 5]

to something like this:

opts = [strategy: :simple_one_for_one, max_restarts: 0, max_seconds: 5]

Next, try changing worker_opts from

worker_opts = [restart: :permanent, function: f]

worker_opts = [restart: :temporary, function: f]

Remember to set opts back to the original value.

6.8. Summary

In this chapter, you learned about the following:

OTP Supervisor behavior
Supervisor restart strategies
Using ETS to store state
How to construct Supervisor hierarchies, both static and dynamic
The various Supervisor and child specification options
Implementing a basic worker-pool application

You’ve seen how, by using different restart strategies, the Supervisor can dictate how its children restart. More important, depending again on the restart strategy, the Supervisor can isolate crashes to only the process affected.

Even though the first version of Pooly is simple, it allowed you to experiment with constructing both static and dynamic supervision hierarchies. In the former case, you declared in the supervision specification of Pooly.Supervisor that Pooly.Server is to be supervised. In the latter case, Pooly.WorkerSupervisor is only added to the supervision tree when Pooly.Server is initialized.

In the following chapter, you’ll continue to evolve the design of Pooly while adding more features. At the same time, you’ll explore more advanced uses of Supervisor.