chapter twelve

12 Uncovering system-level failures and service communication problems

This chapter covers

Troubleshooting failures in multiservice Java systems
Investigating common pitfalls in REST, gRPC, and messaging
Unfolding serialization and versioning problems between services
Investigating cascading failures, retries, and timeout problems

“Why is the payment service down?”

“Because the email service is slow.”

“What?”

In a system of services, failure is a team sport, and you may not even be invited to the game. One service times out, another starts retrying furiously, and suddenly, your logs are full of errors from a completely unrelated module. The challenge is that problems rarely stay local; they echo through the system, bouncing off APIs, queues, and unsuspecting services that were just minding their own business. By the time you join the debugging party, half the system is on fire, and no one remembers who lit the match.

Let me tell you about the time the user profile service refused to start. After extensive digging, we discovered that the service was waiting on a downstream dependency that had nothing to do with user profiles. That dependency was, in turn, waiting on a message from a service that had problems deploying. This was the software equivalent of a group of friends refusing to order pizza until someone who wasn’t even invited to the party showed up.

12 Uncovering system-level failures and service communication problems

This chapter covers

12.1 Troubleshooting communication patterns: RPC and messaging

12.1.1 Working with trace IDs and spans

12.1.2 OpenTelemetry, Jaeger, Zipkin, and other utilities

12.2 Serialization mismatches and versioning problems

12.3 Understanding systemic failure modes

12.3.1 Cascading failures

12.3.2 Retry storms

12.3.3 Timeout mismatches

Summary