chapter seven

7 Cloud operations

This chapter covers

How to manage incidents
Health monitoring and alerts
Governance and usage tracking

So far in this book, we have focused on the happy path of our cloud-native setup, covering the signal sources, telemetry, and destinations to store, query, visualize, and interact with the signals to understand and influence the system. In this chapter, we will discuss an aspect of cloud-native solutions I call cloud operations, which spans several topics you will likely come across, especially in an operations role.

We start off with incidents: how to detect when something is not working the way that it should, react to abnormal behavior, and learn from previous mistakes. Then, we focus on alerts, or alarms (I’m using these terms interchangeably here, though an alert can be a triggered condition, and alarms can potentially cover multiple alerts)—that is, the automated process to check for a condition and inform someone responsible for a service or a piece of infrastructure about it. In the final part of this chapter, we talk about usage tracking, be that what your internal or external users access or the costs of the resources you’re using to provide a cloud-native app.

7.1 Incident management

7.1.1 Health and performance monitoring

7.1.2 Handling the incident

7 Cloud operations

This chapter covers

7.1 Incident management

7.1.1 Health and performance monitoring

7.1.2 Handling the incident

7.1.3 Learning from the incident after the fact

7.2 Alerting

7.2.1 Prometheus alerting

7.2.2 Using Grafana for alerting

7.2.3 Cloud providers

7.3 Usage tracking

7.3.1 Users

7.3.2 Costs

Summary