Multi-Cluster Service Mesh Failover and Fallback Routing
In this blog series, we will dig into specific challenge areas for multi-cluster Kubernetes and service mesh architecture, considerations and approaches in solving them. In our first post, we looked at service discovery and in this post we’ll look at failover and fallback routing across multiple clusters.
What is failover and fallback routing?
When building applications on cloud platforms like Kubernetes (ie, ones where compute/network/storage is ephemeral and unreliable), planning for failures isn’t just nice it’s a prerequisite to these architectures. Instead of only working to prevent failures, implementing a strategy to gracefully handle an unplanned failure is critical to the customer experience and to avoid any potential cascading failures across other dependent services. Microservices architecture exacerbates this as there are many layers (physical or abstracted infrastructure, applications) and locations (distributed and dynamic) where the failure can happen.
Some useful resilience practices like circuit breaking, retries and timeouts can be used to mitigate intermittent network failures or misbehaving services but what happens when there are partial failures of a service across multiple clusters? In other words some portion of the service dependency is misbehaving but there may be alternative options. That’s where multi-cluster failover and fallback can come into play.
Challenges in handling failures across clusters
Things get more complicated as you think about how to handle failover / fallback routing to different clusters. How does a service decide when to failover within a cluster versus outside the cluster? How does a service in one network find a service in another network? What options for automatic failover do you have?
Service Mesh Hub will make your life easier, a lot easier !
Service Mesh Hub is an open-source service mesh management plane that simplifies service-mesh operations and lets you manage multiple clusters of a service mesh from a centralized management plane. Service Mesh Hub takes care of things like shared-trust or root CA federation, workload discovery, unified multi-cluster and global traffic policy, access policy, and failover.
Leveraging the automatic service discovery and global service registry of the virtual mesh, Service Mesh Hub allows admins to apply traffic routing policies at a global level across multiple clusters. The virtual mesh is a trust zone for the clusters within it so that the application traffic between services can traverse across the cluster boundaries.
Service Mesh Hub in action
To demonstrate how Service Mesh Hub handles global failover routing, I’ll walk through an environment and simulation. Here is the environment I’ve prepared.
I’ve deployed 3 Kubernetes clusters (using Kind); one cluster has Service Mesh Hub installed and the other two have Istio installed.
I run the following commands to deploy the Bookinfo app on the first cluster:
kubectl --context kind-kind2 label namespace default istio-injection=enabled kubectl --context kind-kind2 apply -f https://raw.githubusercontent.com/istio/istio/1.7.0/samples/bookinfo/platform/kube/bookinfo.yaml -l 'app,version notin (v3)' kubectl --context kind-kind2 apply -f https://raw.githubusercontent.com/istio/istio/1.7.0/samples/bookinfo/platform/kube/bookinfo.yaml -l 'account' kubectl --context kind-kind2 apply -f https://raw.githubusercontent.com/istio/istio/1.7.0/samples/bookinfo/networking/bookinfo-gateway.yaml
And I check that the app is running using kubectl --context kind-kind2 get pods
:
NAME READY STATUS RESTARTS AGE
details-v1-558b8b4b76-w9qp8 2/2 Running 0 2m33s
productpage-v1-6987489c74-54lvk 2/2 Running 0 2m34s
ratings-v1-7dc98c7588-pgsxv 2/2 Running 0 2m34s
reviews-v1-7f99cc4496-lwtsr 2/2 Running 0 2m34s
reviews-v2-7d79d5bd5d-mpsk2 2/2 Running 0 2m34s
As you can see, it didn’t deploy the v3
version of the reviews
micro service.
Then, I run the following commands to deploy the app on the first cluster:
kubectl --context kind-kind3 label namespace default istio-injection=enabled kubectl --context kind-kind3 apply -f https://raw.githubusercontent.com/istio/istio/1.7.0/samples/bookinfo/platform/kube/bookinfo.yaml kubectl --context kind-kind3 apply -f https://raw.githubusercontent.com/istio/istio/1.7.0/samples/bookinfo/networking/bookinfo-gateway.yaml
When I refresh the web page several times, I see only the versions v1
(no stars) and v2
(black stars), which means that all the requests are handled by the first cluster.
Now, let’s create a TrafficPolicy to define outlier detection settings to detect and evict unhealthy hosts for the reviews
micro service.
cat << EOF| kubectl --context kind-kind1 apply -f - apiVersion: networking.smh.solo.io/v1alpha2 kind: TrafficPolicy metadata: namespace: service-mesh-hub name: mgmt-reviews-outlier spec: destinationSelector: - kubeServiceRefs: services: - name: reviews namespace: default clusterName: kind2 - name: reviews namespace: default clusterName: kind3 outlierDetection: consecutiveErrors: 1 interval: 10s baseEjectionTime: 2m EOF
Then, I create a FailoverService to define a new hostname (reviews-failover.default.global
) that we’ll be backed by the reviews
micro service runnings on both clusters.
cat << EOF| kubectl kubectl --context kind-kind1 apply -f - apiVersion: networking.smh.solo.io/v1alpha2 kind: FailoverService metadata: name: reviews-failover namespace: service-mesh-hub spec: hostname: reviews-failover.default.global port: number: 9080 protocol: http meshes: - name: istiod-istio-system-kind2 namespace: service-mesh-hub backingServices: - kubeService: name: reviews namespace: default clusterName: kind2 - kubeService: name: reviews namespace: default clusterName: kind3 EOF
Finally, I can define another TrafficPolicy to make sure all the requests for the reviews
micro service on the local cluster will be handled by the FailoverService we’ve just created.
cat << EOF| kubectl kubectl --context kind-kind1 apply -f - apiVersion: networking.smh.solo.io/v1alpha2 kind: TrafficPolicy metadata: name: reviews-shift-failover namespace: default spec: destinationSelector: - kubeServiceRefs: services: - clusterName: kind2 name: reviews namespace: default trafficShift: destinations: - failoverServiceRef: name: reviews-failover namespace: service-mesh-hub EOF
I’m doing a port-forward to access the Envoy API of the sidecar proxy running in the productpage
Pod.
It allows me to get some stats about the number and status of the requests sent by the productpage
to the reviews
micro service before the failover:
# Req Resp Source principal Target principal Target version 5987 200 spiffe://kind2/ns/default/sa/bookinfo-productpage spiffe://kind2/ns/default/sa/bookinfo-reviews v1 14 503 spiffe://kind2/ns/default/sa/bookinfo-productpage spiffe://kind2/ns/default/sa/bookinfo-reviews v1 5943 200 spiffe://kind2/ns/default/sa/bookinfo-productpage spiffe://kind2/ns/default/sa/bookinfo-reviews v2 3 503 spiffe://kind2/ns/default/sa/bookinfo-productpage spiffe://kind2/ns/default/sa/bookinfo-reviews v2 1044 200 spiffe://kind2/ns/default/sa/bookinfo-productpage spiffe://kind3/ns/default/sa/bookinfo-reviews v1 606 200 spiffe://kind2/ns/default/sa/bookinfo-productpage spiffe://kind3/ns/default/sa/bookinfo-reviews v2 1527 200 spiffe://kind2/ns/default/sa/bookinfo-productpage spiffe://kind3/ns/default/sa/bookinfo-reviews v3
Then, I’m launching Apache JMeter to send some workload to the productpage
micro service of the first cluster (through the Istio Ingress Gateway).
After 30 seconds, I’m making the reviews
services unavailable on the first cluster.
kubectl --context kind-kind2 patch deploy reviews-v1 --patch '{"spec": {"template": {"spec": {"containers": [{"name": "reviews","command": ["sleep", "20h"]}]}}}}' kubectl --context kind-kind2 patch deploy reviews-v2 --patch '{"spec": {"template": {"spec": {"containers": [{"name": "reviews","command": ["sleep", "20h"]}]}}}}'
When I refresh the web page several times again, I see only v3
(red stars) as well, which means that all the requests are handled by the second cluster.
After 2 minutes, I stop JMeter and look at the Envoy stats again:
# Req Resp Source principal Target principal Target version 7596 200 spiffe://kind2/ns/default/sa/bookinfo-productpage spiffe://kind2/ns/default/sa/bookinfo-reviews v1 15 503 spiffe://kind2/ns/default/sa/bookinfo-productpage spiffe://kind2/ns/default/sa/bookinfo-reviews v1 7568 200 spiffe://kind2/ns/default/sa/bookinfo-productpage spiffe://kind2/ns/default/sa/bookinfo-reviews v2 17 503 spiffe://kind2/ns/default/sa/bookinfo-productpage spiffe://kind2/ns/default/sa/bookinfo-reviews v2 3517 200 spiffe://kind2/ns/default/sa/bookinfo-productpage spiffe://kind3/ns/default/sa/bookinfo-reviews v1 704 200 spiffe://kind2/ns/default/sa/bookinfo-productpage spiffe://kind3/ns/default/sa/bookinfo-reviews v2 3419 200 spiffe://kind2/ns/default/sa/bookinfo-productpage spiffe://kind3/ns/default/sa/bookinfo-reviews v3
We can see that requests have been sent to the first cluster at the beginning of the test, then a few 503 errors have been returned and finally the request have been sent to the second cluster.
And the performance hasn’t been really impacted by the failover.
Obviously, the latency between the 2 clusters could have a greater impact in a production environment.
Now, let’s understand what happened when we created the TrafficPolicies and the FailoverService.
mgmt-reviews-outlier
TrafficPolicy:
Service Mesh Hub updates the reviews
DestinatioRule on both clusters to add the outlier detection specified in the TrafficPolicy.
Note that
maxEjectionPercent
default value is10%
in Istio ! That’s why Service Mesh Hub set it to100%
if there’s no value specified in the TrafficPolicy.
reviews-failover
FailoverService:
Service Mesh Hub creates an EnvoyFilter on the first Kubernetes cluster to spread to replace the outbound|9080||reviews-failover.default.global
Envoy cluster by the following ones:
- outbound|9080||reviews.default.svc.cluster.local
- outbound|9080||reviews.default.svc.kind3.global
Service Mesh Hub creates a ServiceEntry for the reviews-failover.default.global
hosts.
reviews-shift-failover
TrafficPolicy:
Service Mesh Hub creates a VirtualService on the first Kubernetes cluster to tell Istio to send the requests for the reviews
micro service to the reviews-failover.default.global
host.
Get Started
Service Mesh Hub was updated and open sourced in May and has recently started community meetings to expand the conversation around service mesh. We invite you to check out the project and join the community. Solo.io also offers enterprise support for Istio service mesh for those looking to operationalize service mesh environments, request a meeting to learn more here.
- Learn more about Service Mesh Hub
- Read the docs and watch the demos
- Request a personalized demo
- Questions? Join the community