chapter three

3 Observability

This chapter covers

Diagnosing system performance issues with the USE method
Understanding basic system metrics used in chaos experiments
Using Linux tools to check system metrics
Using a time-series database to gain continuous insight into system performance

Strap in. We’re about to tackle one of the more annoying situations you’ll face when practicing chaos engineering: the infamous “my app is slow” complaint. If the piece of software in question went through all the stages of development and made it to production, chances are that it passed a decent number of tests and that multiple people signed off. If, later, for no obvious reason, the application begins to slow down, it tends to be a sign we’re in for a long day at work.

“My app is slow” offers much more subtlety than an ordinary “my app doesn’t work” and can sometimes be rather tricky to debug. In this chapter, you’ll learn how to deal with one of the popular reasons for that: resource contention. We will cover tools necessary to detect and analyze this kind of issue.

3.1 The app is slow

3.2 The USE method

3.3 Resources

3.3.1 System overview

3.3.2 Block I/O

3.3.3 Networking

3.3.4 RAM

3.3.5 CPU

3.3.6 OS

3.4 Application

3.4.1 cProfile

3.4.2 BCC and Python

3.5 Automation: Using time series

3.5.1 Prometheus and Grafana

3.6 Further reading

Summary