chapter six

6 Alert fatigue

 

This chapter covers

  • On-call best practices
  • Staffing for on-call rotations
  • Tracking on-call happiness
  • Providing ways to improve the on-call experience

When you launch a system into production, you’re often paranoid and completely ill-equipped to understand all of the different ways your system might break. You spend a lot of time creating alarms for all the nightmare scenarios you can think of. But the problem with that is you generate a lot of noise in your alerting system that quickly becomes ignored and treated as the normal rhythms of the business. This pattern is called alert fatigue and can lead your team to serious burn-out.

6.1            War Story

6.2            The purpose of on-call

6.3            Defining on-call rotations

6.3.1                     Time to acknowledge

6.3.2                     Time to begin

6.3.3                     Time to resolve

6.4            Defining alert criteria

6.4.1                     Thresholds

6.4.2                     Noisy alerts

6.5            Staffing for on-call

6.6            How to compensate for on-call

6.6.1                     Monetary compensation

6.6.2                     Time Off

6.6.3                     Increased work from home flexibility

6.7            Tracking on-call happiness

6.7.1                     Who is being alerted?

6.7.2                     What level of urgency is the alert?

6.8            Other on-call tasks

6.8.1                     On-call support projects

6.9            Summary