chapter eleven

11 Building a monitoring system

This chapter covers

Understanding what signals to gather from running applications
Building a monitoring system to collect metrics
Learning how to use the collected signals to set up alerts
Observing the behavior of individual services and their interactions as a system

You’ve now set up an infrastructure to run your services and have deployed multiple components that you can combine to provide functionality to your users. In this chapter and the next, we’ll consider how you can make sure you’ll always be able to know how those components are interacting and how the infrastructure is behaving. It’s fundamental to know as early as possible when something isn’t behaving as expected. In this chapter, we’ll focus on building a monitoring system so you can collect relevant metrics, observe the system behavior, and set up relevant alerts to allow you to keep your systems running smoothly by taking actions preemptively. When you can’t be preemptive, you’ll at least be able to quickly pinpoint the areas that need your attention so you can address any issues. It’s also worth mentioning that you should instrument as much as possible. The collected data you may not use today may turn out to be useful someday.

11.1 A robust monitoring stack

11.1.1 Good monitoring is layered

11.1.2 Golden signals

11.1.3 Types of metrics

11.1.4 Recommended practices

11.2 Monitoring SimpleBank with Prometheus and Grafana

11.2.1 Setting up your metric collection infrastructure

11.2.2 Collecting infrastructure metrics — RabbitMQ

11.2.3 Instrumenting SimpleBank’s place order

11.2.4 Setting up alerts

11.3 Raising sensible and actionable alerts

11.3.1 Who needs to know when something is wrong?

11.3.2 Symptoms, not causes

11.4 Observing the whole application

Summary