6 Alert fatigue

 

This chapter covers

  • Using on-call best practices
  • Staffing for on-call rotations
  • Tracking on-call happiness
  • Providing ways to improve the on-call experience

When you launch a system into production, you’re often paranoid and completely ill-equipped to understand all of the ways your system might break. You spend a lot of time creating alarms for all the nightmare scenarios you can think of. But the problem with that is you generate a lot of noise in your alerting system that quickly becomes ignored and treated as the normal rhythms of the business. This pattern, called alert fatigue, can lead your team to serious burnout.

This chapter focuses on the aspects of on-call life for teams and how best to set them up for success. I detail what a good on-call alert looks like, how to manage documentation on resolving issues, and how to structure daytime duties for team members who are on call for the week. Later in the chapter, I focus on tasks that are more management focused, specifically around tracking on-call load, staffing appropriately for on-call work, and structuring compensation.

6.1 War story

6.2 The purpose of on-call rotation

6.3 Defining on-call rotations

6.3.1 Time to acknowledge

6.3.2 Time to begin

6.3.3 Time to resolve

6.4 Defining alert criteria

6.4.1 Thresholds

6.4.2 Noisy alerts

6.5 Staffing on-call rotations

6.6 Compensating for being on call

6.6.1 Monetary compensation

6.6.2 Time off

6.6.3 Increased work-from-home flexibility

6.7 Tracking on-call happiness

6.7.1 Who is being alerted?

6.7.2 What level of urgency is the alert?

6.7.3 How is the alert being delivered?

6.7.4 When is the team member being alerted?