chapter three

3 Operational blindness

This chapter covers

Changes in operations functions
Creating useful system metrics for your application
Creating useful logging habits

When you launch a system, you expect it to perform a set of tasks, in a certain order with a few expected results. Sometimes you might expect an error in the process, and you’ll need to perform some sort of clean-up process around that error. But the complexity of getting the system to work in the best of times leaves a lot of room for improvement for how the tool performs in the worst of times. Creating tools to confirm that work is happening the way you expected gets omitted, leaving you with no clear view as to what’s happening in your system. Instead teams rely on easily obtained metrics that offer no real business context into how the system is performing. While you have generic performance numbers, you’re effectively blind from an operational viewpoint. This operational blindness prevents you from making good decisions about your system.

3.1 War stories

3.2 Changing the scope of development and operations

3.3 Understanding the product

3.4 Creating operational visibility

3.4.1 Creating custom metrics

3.4.2 Defining healthy metrics

3.4.3 Failure Mode Effects Analysis

3.5 Making logging useful

3.5.1 Log aggregation

3.5.2 What should I be logging?

3.5.3 The hurdles of log aggregation

3.6 Summary