Gloo Platform: Stay Operational During Regional Outages with a Highly Available Management Plane
Regional High Availability (HA) is often a key requirement for large enterprises, financial institutions and telcos that have strict SLAs to protect their application availability in case of regional outages. While a lot of time and effort is spent on that front, regional HA in the management and configuration plane is often overlooked, which can cause delay in propagating key policies related to failover behavior of these applications.
At Solo.io, we work with many customers who have strict requirements around regional HA for all aspects of their infrastructure and the applications running on them. Gloo Platform has always supported a multi-workload-cluster, multi-region data plane, managed by a single management server that is resilient and scalable.
In the 2.4.0 release, we are excited to announce official support for multiple management servers that can span multiple clusters located in multiple regions for additional resiliency. This means that in addition to the data plane surviving a regional outage, the management plane can also continue to receive and push configuration changes.
Management vs. Data Plane
Before we dive into multiple management servers, let’s focus on the key functions performed by the Gloo Platform management plane. The management plane processes and reconciles configurations across all workload clusters in the data plane. The data plane then serves workloads using those configurations and will continue operating in the absence of a management server. Any configurations in place at the time the management server becomes unavailable will remain in effect and workloads will continue to operate in a multi-region configuration without data plane downtime.
Management Plane: Multi-Zone and Multi-Region
The Gloo Platform management plane supports two forms of resiliency — horizontal replica scaling and multiple management server clusters.
Since the release of Gloo Platform 2.1, the horizontal replica scaling feature provides multi-zone resilience within the same region. This is accomplished by replicating the management server deployment within a single Kubernetes cluster. These replicas provide distributed scaling where each replica handles a unique subset of workload agents.
Implementing multiple management server Kubernetes clusters provides multi-region resilience and may also be used to recover from entire Kubernetes cluster failures. Each management server Kubernetes cluster may be deployed to different regions or the same region, with the only requirement being that workload clusters have a network path to connect to the management servers. Agents connect to only one management cluster at a time and the remaining clusters operate as standbys.
What is Considered a Failure?
On the surface, determining failure seems simple; if one management cluster is unavailable, another should take over. Considering all possible hosting topologies and architectures, a failure could be a number of things:
- Kubernetes control plane failure
- Zonal or regional network failure
- Load balancer failure between two components, or simply a misconfiguration of some part of the Kubernetes cluster that hosts the Gloo management server
Given the variety of potential failures, Gloo Platform shouldn’t be responsible for determining which management server is the primary server. To solve this problem, Gloo Platform requires a DNS record that resolves to the currently active management server. This provides the most flexibility for different clouds and hosting topologies.
To support configuration consistency during or after a failover, all Kubernetes resources must be written to all Kubernetes clusters that host a management server. This could be done with simple CI pipelines that write to passive and active management clusters in serial, or using GitOps tooling such as ArgoCD or Flux that implement a pull model to keep Kubernetes resources in sync across cluster fleets.
The Gloo Platform management server uses Redis to store processed configurations for all connected agents from workload clusters. We recommend using a shared Redis datastore for all management servers to prevent configuration drift and reduce the time it takes for a newly active management server to assume control and be able to accept new configuration changes.
When shared Redis is used, all management servers have a complete and identical view of the configuration state of all workload servers in the data plane. At the time of failover, the newly active management server will honor configurations in place even if all agents have not yet reconnected — this means no configuration drift throughout the recovery process. The new management server will also have greatly reduced warmup time as it will not need to re-process all configurations.
Multi-Region AWS Demo
Check out Solo.io’s AWS multi-region demo on YouTube! In the demo we deploy multiple workload and management servers to two regions and show how Gloo Platform handles workload and whole-region failures in AWS.