12 Building a distributed system

 

This chapter covers

  • Working with distribution primitives
  • Building a fault-tolerant cluster
  • Network considerations

Now that you have a to-do HTTP server in place, it’s time to make it more reliable. To have a truly reliable system, you need to run it on multiple machines. A single machine represents a single point of failure because a machine crash leads to a system crash. In contrast, with a cluster of multiple machines, a system can continue providing service even when individual machines are taken down.

Moreover, by clustering multiple machines, you have a chance of scaling horizontally. When demand for the system increases, you can add more machines to the cluster to accommodate the extra load. This idea is illustrated in figure 12.1.

Here you have multiple nodes sharing the load. If a node crashes, the remaining load will be spread across survivors, and you can continue to provide service. If the load increases, you can add more nodes to the cluster to take the extra load. Clients access a well-defined endpoint and are unaware of internal cluster details.

Distributed systems obviously offer significant benefits, and Elixir/Erlang gives you some simple and yet powerful distribution primitives. The central tools for distributed Erlang-based systems are processes and messages. You can send a message to another process regardless of whether it’s running in the same BEAM instance or on another instance on a remote machine.

12.1 Distribution primitives

12.1.1 Starting a cluster

12.1.2 Communicating between nodes

12.1.3 Process discovery

12.1.4 Links and monitors

12.1.5 Other distribution services

12.2 Building a fault-tolerant cluster

12.2.1 Cluster design

12.2.2 The distributed to-do cache

12.2.3 Implementing a replicated database

12.2.4 Testing the system

12.2.5 Detecting partitions

12.2.6 Highly available systems

12.3 Network considerations

12.3.1 Node names

12.3.2 Cookies

12.3.3 Hidden nodes

12.3.4 Firewalls

Summary