Multi-Cluster Service Mesh Failover and Fallback Routing

In this blog series, we will dig into specific challenge areas for multi-cluster Kubernetes and service mesh architecture, considerations and approaches in solving them. In our first post, we looked at service discovery and in this post we’ll look at failover and fallback routing across multiple clusters. 

What is failover and fallback routing?

When building applications on cloud platforms like Kubernetes (ie, ones where compute/network/storage is ephemeral and unreliable), planning for failures isn’t just nice it’s a prerequisite to these architectures. Instead of only working to prevent failures, implementing a strategy to gracefully handle an unplanned failure is critical to the customer experience and to avoid any potential cascading failures across other dependent services. Microservices architecture exacerbates this as there are many layers (physical or abstracted infrastructure, applications) and locations (distributed and dynamic) where the failure can happen. 

Some useful resilience practices like circuit breaking, retries and timeouts can be used to mitigate intermittent network failures or misbehaving services but what happens when there are partial failures of a service across multiple clusters? In other words some portion of the service dependency is misbehaving but there may be alternative options. That’s where multi-cluster failover and fallback can come into play.

Challenges in handling failures across clusters

Things get more complicated as you think about how to handle failover / fallback routing to different clusters. How does a service decide when to failover within a cluster versus outside the cluster? How does a service in one network find a service in another network? What options for automatic failover do you have?

Service Mesh Hub will make your life easier, a lot easier !

Service Mesh Hub is an open-source service mesh management plane that simplifies service-mesh operations and lets you manage multiple clusters of a service mesh from a centralized management plane. Service Mesh Hub takes care of things like shared-trust or root CA federation, workload discovery, unified multi-cluster and global traffic policy, access policy, and failover.

Leveraging the automatic service discovery and global service registry of the virtual mesh, Service Mesh Hub allows admins to apply traffic routing policies at a global level across multiple clusters. The virtual mesh is a trust zone for the clusters within it so that the application traffic between services can traverse across the cluster boundaries.

Service Mesh Hub in action

To demonstrate how Service Mesh Hub handles global failover routing, I’ll walk through an environment and simulation. Here is the environment I’ve prepared.

I’ve deployed 3 Kubernetes clusters (using Kind); one cluster has Service Mesh Hub installed and the other two have Istio installed.

I run the following commands to deploy the Bookinfo app on the first cluster:

kubectl --context kind-kind2 label namespace default istio-injection=enabled 
kubectl --context kind-kind2 apply -f https://raw.githubusercontent.com/istio/istio/1.7.0/samples/bookinfo/platform/kube/bookinfo.yaml -l 'app,version notin (v3)' 
kubectl --context kind-kind2 apply -f https://raw.githubusercontent.com/istio/istio/1.7.0/samples/bookinfo/platform/kube/bookinfo.yaml -l 'account' 
kubectl --context kind-kind2 apply -f https://raw.githubusercontent.com/istio/istio/1.7.0/samples/bookinfo/networking/bookinfo-gateway.yaml

And I check that the app is running using kubectl --context kind-kind2 get pods:

NAME                              READY   STATUS    RESTARTS   AGE
details-v1-558b8b4b76-w9qp8       2/2     Running   0          2m33s
productpage-v1-6987489c74-54lvk   2/2     Running   0          2m34s
ratings-v1-7dc98c7588-pgsxv       2/2     Running   0          2m34s
reviews-v1-7f99cc4496-lwtsr       2/2     Running   0          2m34s
reviews-v2-7d79d5bd5d-mpsk2       2/2     Running   0          2m34s

As you can see, it didn’t deploy the v3 version of the reviews micro service.

Then, I run the following commands to deploy the app on the first cluster:

kubectl --context kind-kind3 label namespace default istio-injection=enabled
kubectl --context kind-kind3 apply -f https://raw.githubusercontent.com/istio/istio/1.7.0/samples/bookinfo/platform/kube/bookinfo.yaml
kubectl --context kind-kind3 apply -f https://raw.githubusercontent.com/istio/istio/1.7.0/samples/bookinfo/networking/bookinfo-gateway.yaml

When I refresh the web page several times, I see only the versions v1 (no stars) and v2 (black stars), which means that all the requests are handled by the first cluster.

Now, let’s create a TrafficPolicy to define outlier detection settings to detect and evict unhealthy hosts for the reviews micro service.

cat << EOF| kubectl --context kind-kind1 apply -f -

apiVersion: networking.smh.solo.io/v1alpha2
kind: TrafficPolicy
metadata:
  namespace: service-mesh-hub
  name: mgmt-reviews-outlier
spec:
  destinationSelector:
  - kubeServiceRefs:
      services:
      - name: reviews
        namespace: default
        clusterName: kind2
      - name: reviews
        namespace: default
        clusterName: kind3
  outlierDetection:
    consecutiveErrors: 1
    interval: 10s
    baseEjectionTime: 2m
EOF

Then, I create a FailoverService to define a new hostname (reviews-failover.default.global) that we’ll be backed by the reviews micro service runnings on both clusters.

cat << EOF| kubectl kubectl --context kind-kind1 apply -f -

apiVersion: networking.smh.solo.io/v1alpha2
kind: FailoverService
metadata:
  name: reviews-failover
  namespace: service-mesh-hub
spec:
  hostname: reviews-failover.default.global
  port:
    number: 9080
    protocol: http
  meshes:
    - name: istiod-istio-system-kind2
      namespace: service-mesh-hub
  backingServices:
  - kubeService:
      name: reviews
      namespace: default
      clusterName: kind2
  - kubeService:
      name: reviews
      namespace: default
      clusterName: kind3
EOF

Finally, I can define another TrafficPolicy to make sure all the requests for the reviews micro service on the local cluster will be handled by the FailoverService we’ve just created.

cat << EOF| kubectl kubectl --context kind-kind1 apply -f -
apiVersion: networking.smh.solo.io/v1alpha2
kind: TrafficPolicy
metadata:
  name: reviews-shift-failover
  namespace: default
spec:
  destinationSelector:
  - kubeServiceRefs:
      services:
      - clusterName: kind2
        name: reviews
        namespace: default
  trafficShift:
    destinations:
    - failoverServiceRef:
        name: reviews-failover
        namespace: service-mesh-hub
EOF

I’m doing a port-forward to access the Envoy API of the sidecar proxy running in the productpage Pod.

It allows me to get some stats about the number and status of the requests sent by the productpage to the reviews micro service before the failover:

 # Req  Resp    Source principal                                        Target principal                                Target version
 5987   200     spiffe://kind2/ns/default/sa/bookinfo-productpage       spiffe://kind2/ns/default/sa/bookinfo-reviews   v1
 14     503     spiffe://kind2/ns/default/sa/bookinfo-productpage       spiffe://kind2/ns/default/sa/bookinfo-reviews   v1
 5943   200     spiffe://kind2/ns/default/sa/bookinfo-productpage       spiffe://kind2/ns/default/sa/bookinfo-reviews   v2
 3      503     spiffe://kind2/ns/default/sa/bookinfo-productpage       spiffe://kind2/ns/default/sa/bookinfo-reviews   v2
 1044   200     spiffe://kind2/ns/default/sa/bookinfo-productpage       spiffe://kind3/ns/default/sa/bookinfo-reviews   v1
 606    200     spiffe://kind2/ns/default/sa/bookinfo-productpage       spiffe://kind3/ns/default/sa/bookinfo-reviews   v2
 1527   200     spiffe://kind2/ns/default/sa/bookinfo-productpage       spiffe://kind3/ns/default/sa/bookinfo-reviews   v3

Then, I’m launching Apache JMeter to send some workload to the productpage micro service of the first cluster (through the Istio Ingress Gateway).

After 30 seconds, I’m making the reviews services unavailable on the first cluster.

kubectl --context kind-kind2 patch deploy reviews-v1 --patch '{"spec": {"template": {"spec": {"containers": [{"name": "reviews","command": ["sleep", "20h"]}]}}}}'
kubectl --context kind-kind2 patch deploy reviews-v2 --patch '{"spec": {"template": {"spec": {"containers": [{"name": "reviews","command": ["sleep", "20h"]}]}}}}'

When I refresh the web page several times again, I see only v3 (red stars) as well, which means that all the requests are handled by the second cluster.

After 2 minutes, I stop JMeter and look at the Envoy stats again:

# Req  Resp    Source principal                                        Target principal                                Target version
 7596   200     spiffe://kind2/ns/default/sa/bookinfo-productpage       spiffe://kind2/ns/default/sa/bookinfo-reviews   v1
 15     503     spiffe://kind2/ns/default/sa/bookinfo-productpage       spiffe://kind2/ns/default/sa/bookinfo-reviews   v1
 7568   200     spiffe://kind2/ns/default/sa/bookinfo-productpage       spiffe://kind2/ns/default/sa/bookinfo-reviews   v2
 17     503     spiffe://kind2/ns/default/sa/bookinfo-productpage       spiffe://kind2/ns/default/sa/bookinfo-reviews   v2
 3517   200     spiffe://kind2/ns/default/sa/bookinfo-productpage       spiffe://kind3/ns/default/sa/bookinfo-reviews   v1
 704    200     spiffe://kind2/ns/default/sa/bookinfo-productpage       spiffe://kind3/ns/default/sa/bookinfo-reviews   v2
 3419   200     spiffe://kind2/ns/default/sa/bookinfo-productpage       spiffe://kind3/ns/default/sa/bookinfo-reviews   v3

We can see that requests have been sent to the first cluster at the beginning of the test, then a few 503 errors have been returned and finally the request have been sent to the second cluster.

And the performance hasn’t been really impacted by the failover.

Obviously, the latency between the 2 clusters could have a greater impact in a production environment.

Now, let’s understand what happened when we created the TrafficPolicies and the FailoverService.

  • mgmt-reviews-outlier TrafficPolicy:

Service Mesh Hub updates the reviews DestinatioRule on both clusters to add the outlier detection specified in the TrafficPolicy.

Note that maxEjectionPercent default value is 10% in Istio ! That’s why Service Mesh Hub set it to 100%if there’s no value specified in the TrafficPolicy.

  • reviews-failover FailoverService:

Service Mesh Hub creates an EnvoyFilter on the first Kubernetes cluster to spread to replace the outbound|9080||reviews-failover.default.global Envoy cluster by the following ones:

  • outbound|9080||reviews.default.svc.cluster.local
  • outbound|9080||reviews.default.svc.kind3.global

 

Service Mesh Hub creates a ServiceEntry for the reviews-failover.default.global hosts.

  • reviews-shift-failover TrafficPolicy:

Service Mesh Hub creates a VirtualService on the first Kubernetes cluster to tell Istio to send the requests for the reviews micro service to the reviews-failover.default.global host.

Get Started

Service Mesh Hub was updated and open sourced in May and has recently started community meetings to expand the conversation around service mesh. We invite you to check out the project and join the community. Solo.io also offers enterprise support for Istio service mesh for those looking to operationalize service mesh environments, request a meeting to learn more here