Istio multi-cluster traffic: Deeper look and debugging
I’ve gotten a lot of great feedback on my previous article, in which I covered how to use istioctl to quickly debug the Envoy configuration that Istio is distributing to the sidecars. In this follow-up article, I’m covering how to debug multi-cluster routing. I also cover how Istio joins workloads together across two independent clusters to enable higher values use cases like failover.
As an example, we will deploy a frontend Deployment and Service on both cluster1 and cluster2. Then, we will use the Istio Ingress gateway on cluster1 to route traffic to the frontend application on both clusters. We want traffic to stay on cluster1 but failover to cluster2 if the local frontend fails:
Let’s see the frontend pod running in both clusters. Note their Pod IP address:
There are few different patterns for deploying Istio Multi-Cluster. For security and scale reasons that I won’t cover in this blog, I recommend the multi-primary, multi-network deployment model and using east-west gateways for inter-cluster traffic. This gateway is just another Istio ingress gateway dedicated to east-west traffic.
Required Istio Resources for routing and failover
Next, let’s cover the Istio configuration we need to achieve our desired use case.
- ServiceEntry – Let’s us define a globally addressable host name, such as”frontend.mycompany.mesh”. My mesh applications can use this hostname to get global service routing.
- VirtualService – Route traffic from Ingress Gateway to the “frontend.mycompany.mesh” destination defined in the ServiceEntry
- DestinationRule – Used to define failover and outlier detection conditions
- Unified trust – Both clusters have intermediate certificates for Istio generated from a common root.
What does Istio do with these resources?
As I covered in my previous blog, Istio reads these resources, converts them to Envoy configuration and sends it to the sidecars and gateways. At a basic high level, this Envoy configuration can be broken down into listeners, routes, clusters and endpoint. A listener tells Envoy to bind to a specific port and uses filters telling it what to do, such as connect to a route configuration. A route has a list of domains that if matched, maps the request to an Envoy cluster. An Envoy cluster is then made up on a set of IP addresses which are the endpoints that back that Envoy cluster.
Let’s take a look at the Envoy cluster configuration of the Ingress gateway on cluster1.
From this output, you can see that the ingress gateway knows about “frontend.mycompany.mesh” hostname and also the endpoint IP addresses that back that hostname.
There are 2 endpoints for frontend.mycompany.mesh.
Again, if you’re using Gloo Mesh, these ServiceEntry and WorkloadEntry objects are automatically generated and kept up to date.
Locality Load Balancing
How does Envoy know to prioritize sending traffic to the local endpoint, and not the remote? Let’s look at the same output in more detail:
istioctl pc endpoints istio-ingressgateway-54fdcf555d-v62p8.istio-gateways --cluster "outbound|80||frontend.mycompany.mesh" -o yaml
Notice the priority field. The first endpoint does not have priority defined, which means it’s priority is 0. The second endpoint’s priority is 1. Envoy will load balance all traffic to the highest priority (lowest number). If those are unhealthy, it will then jump to priority 1.
Next, note the locality field. For local endpoints, the locality is picked up from topology labels on the Kubernetes nodes. For remote endpoints, this information should be provided in the ServiceEntry or WorkloadEntry.
Access Logging and Stats
[2022-11-09T00:00:50.336Z] "GET / HTTP/1.1" 200 - via_upstream - "-" 0 11107 81 80 "10.142.0.66" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/184.108.40.206 Safari/537.36" "36912637-ba06-41e4-97a2-07cddd73f849" "220.127.116.11" "10.112.0.68:8080" outbound|80||frontend.mycompany.mesh 10.112.0.50:41008 10.112.0.50:8080 10.142.0.66:16973 - frontend-frontend.web-team.mgmt-cluster
[2022-11-09T00:06:19.393Z] "GET / HTTP/1.1" 200 - via_upstream - "-" 0 11105 1937 1936 "10.142.0.66" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/18.104.22.168 Safari/537.36" "71fe6441-30c4-4e5d-8dd7-5b72ef66351d" "22.214.171.124" "126.96.36.199:15443" outbound|80||frontend.mycompany.mesh 10.112.0.50:44894 10.112.0.50:8080 10.142.0.66:28281 - frontend-frontend.web-team.mgmt-cluster
istioctl x envoy-stats istio-ingressgateway-54fdcf555d-v62p8.istio-gateways --type clusters | grep frontend.mycompany.mesh
This gives you a detailed breakdown of all the endpoints, their metadata, as well as stats on how many requests they are getting, their error and success rates, etc. This is extremely useful when you’re seeing intermittent errors where one of your endpoints is not functioning well.
I hope you found this helpful. At Solo, we have customers who are deploying some of the largest deployments of service mesh and Istio in the world. We have an amazing team of field engineers who are experts in Envoy and Istio, and work closely with our customers every day.
If you have any questions, reach out to our experts!