3 Operational blindness

 

This chapter covers

  • Making changes in operations functions
  • Creating useful system metrics for your application
  • Creating useful logging habits

When you launch a system, you expect it to perform a set of tasks, in a certain order, with a few expected results. Sometimes you might expect an error in the process, and you’ll need to perform some sort of cleanup process around that error. But the complexity of getting the system to work in the best of times leaves a lot of room for improvement in the way the tool performs in the worst of times.

Creating tools to confirm that work is happening the way you expected gets omitted, leaving you with no clear view as to what’s happening in your system. Instead, teams rely on easily obtained metrics that offer no real business context into how the system is performing. While you have generic performance numbers, you’re effectively blind from an operational viewpoint. This operational blindness prevents you from making good decisions about your system.

3.1 War stories

It’s the middle of the day when a notification goes off in the operations group. Almost in lockstep with that page, emails and instant messages begin firing off. People begin popping up from their desks, trying to figure out whether the notification reached just their computer or something larger is going on. The website is down. The external monitoring of the website failed its last three health checks, which triggered the alert.

3.2 Changing the scope of development and operations

3.3 Understanding the product

3.4 Creating operational visibility

3.4.1 Creating custom metrics

3.4.2 Deciding what to measure

3.4.3 Defining healthy metrics

3.4.4 Failure mode and effects analysis

3.5 Making logging useful

3.5.1 Log aggregation

3.5.2 What should I be logging?

3.5.3 The hurdles of log aggregation

Summary