6 Resilience: solving application-networking challenges

This chapter covers:

The importance of resiliency
Client-side load balancing
Retries / Budgets / Timeouts
Circuit breaking and bulkheads
Advice for migration from application libraries used for resilience

Once we have traffic coming into our cluster through the Istio ingress gateway (covered in Chapter 4) we can manipulate the traffic at the request level and control exactly what versions or "subsets" of a service to which we want certain requests to go. In the previous chapter, we covered this traffic control for weighted routing, request-match based routing, and certain types of release patterns that can be enabled with that. We can also use this traffic control to route around problems in the event of application errors, network partitions, and other major issues.

The problem with distributed systems is that they often fail in unpredictable ways and we will not be able to manually take traffic-shifting actions. What we need is a way to build sensible behaviors into the application so they can respond on their own when they encounter problems. We can do that with Istio including adding timeouts, retries and circuit breaking, without having to alter application code. In this chapter we’ll take a look at how to do this and the implications on the rest of the system.

6.1 Building resilience into the application

6.1.1 Building resilience into application libraries

6.1.2 Using Istio to solve these problems

6.1.3 Decentralized implementation of resilience

6.1.4 Client-side load balancing

6.2 Locality aware load balancing

6.2.1 Hands on with locality load balancing

6.2.2 More control over locality load balancing

6.3 Transparent timeouts and retries

6.3.1 Timeouts

6.3.2 Retries

6.3.3 Advanced retries

6.4 Circuit breaking with Istio

6.4.1 Guarding against slow services with connection pool control

6.4.2 Guarding against unhealthy services with outlier detection

6.5 Summary