chapter three

3 Observability

This chapter covers

Diagnosing system performance issues with the Utilization, Saturation, Errors (USE) method
Understanding basic system metrics we can use in chaos experiments
Using Linux tools like top, free, df, vmstat, sar and Berkeley packet filter to check system metrics
Using time series database prometheus to gain continuous insight into system performance

Strap in. We’re about to tackle one of the more annoying situations you’ll face when practicing chaos engineering: the infamous “my app is slow” complaint. If the piece of software in question went through all the stages of development and made it to production, chances are that it passed a decent number of tests and that multiple people signed it off. If later on, for no obvious reason, it begins to slow down it tends to be a sign we’re in for a long day at work. “My app is slow” offers much more subtlety than an ordinary “‘my app doesn’t work” and can sometimes be rather tricky to debug. In this chapter you’ll learn how to deal with one of the popular reasons for that - resource contention. We will cover tools necessary to detect and analyze this kind of issue.

3.1 The app is slow

3.2 The USE method

3.3 Resources

3.3.1 System overview

3.3.2 Block IO

3.3.3 Networking

3.3.4 RAM

3.3.5 CPU

3.3.6 OS

3.3.7 other tools

3.4 Application

3.4.1 cProfile

3.4.2 BCC and Python

3.5 Automation - using time series

3.5.1 Prometheus & Grafana

3.6 Further reading

3.7 Summary