8 Fault tolerance basics

This chapter covers

Run-time errors
Errors in concurrent systems
Supervisors

Fault tolerance is a first-class concept in BEAM. The ability to develop reliable systems that can operate even when faced with run-time errors is what brought us Erlang in the first place.

The aim of fault tolerance is to acknowledge the existence of failures, minimize their impact, and, ultimately, recover without human intervention. In a sufficiently complex system, many things can go wrong. Occasional bugs will happen, components you’re depending on may fail, and you may experience hardware failures. A system may also become overloaded and fail to cope with an increased incoming request rate. Finally, if a system is distributed, you can experience additional issues, such as a remote machine becoming unavailable, perhaps due to a crash or a broken network link.

It’s hard to predict everything that can go wrong, so it’s better to face the harsh reality that anything can fail. Regardless of which part of the system happens to fail, it shouldn’t take down the entire system; you want to be able to provide at least some service. For example, if the database server becomes unreachable, you can still serve data from the cache. You might even queue incoming store requests and try to resolve them later, when the connection to the database is reestablished.

8.1 Run-time errors

8.1.1 Error types

8.1.2 Handling errors

8.2 Errors in concurrent systems

8.2.1 Linking processes

8.2.2 Monitors

8.3 Supervisors

8.3.1 Preparing the existing code

8.3.2 Starting the supervisor process

8.3.3 Child specification

8.3.4 Wrapping the supervisor

8.3.5 Using a callback module

8.3.6 Linking all processes

8.3.7 Restart frequency