12 Uncovering system-level failures and service communication issues
This chapter covers
- Troubleshooting failures in multi-service Java systems
- Investigating common pitfalls in REST, gRPC, and messaging
- Unfolding serialization and versioning issues between services
- Investigating cascading failures, retries, and timeout problems
“Why is the payment service down?”
“Because the email service is slow.”
“...What?”
In a system of services, failure is a team sport, and you may not even be invited to the game. One service times out, another starts retrying furiously, and suddenly your logs are full of errors from a completely unrelated module. The challenge is that problems rarely stay local; they echo through the system, bouncing off APIs, queues, and unsuspecting services that were just minding their own business. By the time you join the debugging party, half the system is on fire, and no one remembers who lit the match.
Let me tell you about the time the user profile service refused to start. After good time of digging, we discovered that it was waiting on a downstream dependency that had nothing to do with user profiles. That dependency was, in turn, waiting on a message from a service that had problems deploying. This was the software equivalent of a group of friends refusing to order pizza until someone who wasn’t even at the party showed up.