When you launch a system into production, you’re often paranoid and completely ill-equipped to understand all of the ways your system might break. You spend a lot of time creating alarms for all the nightmare scenarios you can think of. But the problem with that is you generate a lot of noise in your alerting system that quickly becomes ignored and treated as the normal rhythms of the business. This pattern, called alert fatigue, can lead your team to serious burnout.
This chapter focuses on the aspects of on-call life for teams and how best to set them up for success. I detail what a good on-call alert looks like, how to manage documentation on resolving issues, and how to structure daytime duties for team members who are on call for the week. Later in the chapter, I focus on tasks that are more management focused, specifically around tracking on-call load, staffing appropriately for on-call work, and structuring compensation.