Multi-Cluster Service Mesh Failover and Fallback Routing

Denis Jannot | September 8, 2020

In this blog series, we will dig into specific challenge areas for multi-cluster Kubernetes and service mesh architecture, considerations and approaches in solving them. In our first post, we looked at service discovery and in this post we’ll look at failover and fallback routing across multiple clusters. 

What is failover and fallback routing?

When building applications on cloud platforms like Kubernetes (ie, ones where compute/network/storage is ephemeral and unreliable), planning for failures isn’t just nice it’s a prerequisite to these architectures. Instead of only working to prevent failures, implementing a strategy to gracefully handle an unplanned failure is critical to the customer experience and to avoid any potential cascading failures across other dependent services. Microservices architecture exacerbates this as there are many layers (physical or abstracted infrastructure, applications) and locations (distributed and dynamic) where the failure can happen. 

Some useful resilience practices like circuit breaking, retries and timeouts can be used to mitigate intermittent network failures or misbehaving services but what happens when there are partial failures of a service across multiple clusters? In other words some portion of the service dependency is misbehaving but there may be alternative options. That’s where multi-cluster failover and fallback can come into play.

Challenges in handling failures across clusters

Things get more complicated as you think about how to handle failover / fallback routing to different clusters. How does a service decide when to failover within a cluster versus outside the cluster? How does a service in one network find a service in another network? What options for automatic failover do you have?

Service Mesh Hub will make your life easier, a lot easier !

Service Mesh Hub is an open-source service mesh management plane that simplifies service-mesh operations and lets you manage multiple clusters of a service mesh from a centralized management plane. Service Mesh Hub takes care of things like shared-trust or root CA federation, workload discovery, unified multi-cluster and global traffic policy, access policy, and failover.

Leveraging the automatic service discovery and global service registry of the virtual mesh, Service Mesh Hub allows admins to apply traffic routing policies at a global level across multiple clusters. The virtual mesh is a trust zone for the clusters within it so that the application traffic between services can traverse across the cluster boundaries.

Service Mesh Hub in action

To demonstrate how Service Mesh Hub handles global failover routing, I’ll walk through an environment and simulation. Here is the environment I’ve prepared.

I’ve deployed 3 Kubernetes clusters (using Kind); one cluster has Service Mesh Hub installed and the other two have Istio installed.

I run the following commands to deploy the Bookinfo app on the first cluster:

And I check that the app is running using kubectl --context kind-kind2 get pods:

As you can see, it didn’t deploy the v3 version of the reviews micro service.

Then, I run the following commands to deploy the app on the first cluster:

When I refresh the web page several times, I see only the versions v1 (no stars) and v2 (black stars), which means that all the requests are handled by the first cluster.

Now, let’s create a TrafficPolicy to define outlier detection settings to detect and evict unhealthy hosts for the reviews micro service.

Then, I create a FailoverService to define a new hostname (reviews-failover.default.global) that we’ll be backed by the reviews micro service runnings on both clusters.

Finally, I can define another TrafficPolicy to make sure all the requests for the reviews micro service on the local cluster will be handled by the FailoverService we’ve just created.

I’m doing a port-forward to access the Envoy API of the sidecar proxy running in the productpage Pod.

It allows me to get some stats about the number and status of the requests sent by the productpage to the reviews micro service before the failover:

Then, I’m launching Apache JMeter to send some workload to the productpage micro service of the first cluster (through the Istio Ingress Gateway).

After 30 seconds, I’m making the reviews services unavailable on the first cluster.

When I refresh the web page several times again, I see only v3 (red stars) as well, which means that all the requests are handled by the second cluster.

After 2 minutes, I stop JMeter and look at the Envoy stats again:

We can see that requests have been sent to the first cluster at the beginning of the test, then a few 503 errors have been returned and finally the request have been sent to the second cluster.

And the performance hasn’t been really impacted by the failover.

Obviously, the latency between the 2 clusters could have a greater impact in a production environment.

Now, let’s understand what happened when we created the TrafficPolicies and the FailoverService.

  • mgmt-reviews-outlier TrafficPolicy:

Service Mesh Hub updates the reviews DestinatioRule on both clusters to add the outlier detection specified in the TrafficPolicy.

Note that maxEjectionPercent default value is 10% in Istio ! That’s why Service Mesh Hub set it to 100%if there’s no value specified in the TrafficPolicy.

  • reviews-failover FailoverService:

Service Mesh Hub creates an EnvoyFilter on the first Kubernetes cluster to spread to replace the outbound|9080||reviews-failover.default.global Envoy cluster by the following ones:

  • outbound|9080||reviews.default.svc.cluster.local
  • outbound|9080||reviews.default.svc.kind3.global

 

Service Mesh Hub creates a ServiceEntry for the reviews-failover.default.global hosts.

  • reviews-shift-failover TrafficPolicy:

Service Mesh Hub creates a VirtualService on the first Kubernetes cluster to tell Istio to send the requests for the reviews micro service to the reviews-failover.default.global host.

Get Started

Service Mesh Hub was updated and open sourced in May and has recently started community meetings to expand the conversation around service mesh. We invite you to check out the project and join the community. Solo.io also offers enterprise support for Istio service mesh for those looking to operationalize service mesh environments, request a meeting to learn more here

Back to Blog