3 Observability
This chapter covers
- Diagnosing system performance issues with the Utilization, Saturation, Errors (USE) method
- Understanding basic system metrics we can use in chaos experiments
- Using Linux tools like top, free, df, vmstat, sar and Berkeley packet filter to check system metrics
- Using time series database prometheus to gain continuous insight into system performance
Strap in. We’re about to tackle one of the more annoying situations you’ll face when practicing chaos engineering: the infamous “my app is slow” complaint. If the piece of software in question went through all the stages of development and made it to production, chances are that it passed a decent number of tests and that multiple people signed it off. If later on, for no obvious reason, it begins to slow down it tends to be a sign we’re in for a long day at work. “My app is slow” offers much more subtlety than an ordinary “‘my app doesn’t work” and can sometimes be rather tricky to debug. In this chapter you’ll learn how to deal with one of the popular reasons for that - resource contention. We will cover tools necessary to detect and analyze this kind of issue.