chapter eight

8 The reference period

This chapter covers:

Understanding the difference between hot and cold backups
Picking the right reference period
Extracting state changes for analysis

Have you ever had a really bad day? You probably have, but January 31, 2017, was an exceptionally bad day for the people at Gitlab, a company that develops a web-based solution to manage code changes. Gitlab develops and maintains their software, but they also host a lot of repositories for customers: Gitlab is used by more than 100,000 organizations. On that January 31, they experienced an increased load on their servers, which — because of some side-effects — led to a situation so they had to reset a backup database server (their secondary server). The problem was that the procedure and technique they used wasn’t properly documented and the lack of documentation caused some confusion. When the server didn’t appear to fetch the data, an engineer decided to completely delete the data on their secondary server to make sure existing data wasn’t stopping data replication. Then it happened: the engineer accidentally ran the command on the main, primary database server.

Boom! Data was lost from both the primary and secondary database servers. They lost all of their users' hosted data.

8.1 TL;DR

8.2 Availability

8 The reference period

This chapter covers:

8.1 TL;DR

8.2 Availability

8.2.1 Ensuring or improving availability

8.2.2 Setting up the hot backup solution

8.3 Monitoring availability

8.3.1 Defining the quality controls reference period

8.3.2 Two quality controls for metrics

8.3.3 Recording state changes

8.4 Summary