Gloo Platform: Stay Operational During Regional Outages with a Highly Available Management Plane

Regional High Availability (HA) is often a key requirement for large enterprises, financial institutions and telcos that have strict SLAs to protect their application availability in case of regional outages. While a lot of time and effort is spent on that front, regional HA in the management and configuration plane is often overlooked, which can cause delay in propagating key policies related to failover behavior of these applications.

At Solo.io, we work with many customers who have strict requirements around regional HA for all aspects of their infrastructure and the applications running on them. Gloo Platform has always supported a multi-workload-cluster, multi-region data plane, managed by a single management server that is resilient and scalable.

In the 2.4.0 release, we are excited to announce official support for multiple management servers that can span multiple clusters located in multiple regions for additional resiliency. This means that in addition to the data plane surviving a regional outage, the management plane can also continue to receive and push configuration changes.

Gloo Platform - Multi-Region Management Plane

Management vs. Data Plane

Before we dive into multiple management servers, let’s focus on the key functions performed by the Gloo Platform management plane. The management plane processes and reconciles configurations across all workload clusters in the data plane. The data plane then serves workloads using those configurations and will continue operating in the absence of a management server. Any configurations in place at the time the management server becomes unavailable will remain in effect and workloads will continue to operate in a multi-region configuration without data plane downtime.

Management Plane: Multi-Zone and Multi-Region

The Gloo Platform management plane supports two forms of resiliency — horizontal replica scaling and multiple management server clusters.

Since the release of Gloo Platform 2.1, the horizontal replica scaling feature provides multi-zone resilience within the same region. This is accomplished by replicating the management server deployment within a single Kubernetes cluster. These replicas provide distributed scaling where each replica handles a unique subset of workload agents.

Implementing multiple management server Kubernetes clusters provides multi-region resilience and may also be used to recover from entire Kubernetes cluster failures. Each management server Kubernetes cluster may be deployed to different regions or the same region, with the only requirement being that workload clusters have a network path to connect to the management servers. Agents connect to only one management cluster at a time and the remaining clusters operate as standbys.

What is Considered a Failure?

On the surface, determining failure seems simple; if one management cluster is unavailable, another should take over. Considering all possible hosting topologies and architectures, a failure could be a number of things:

Kubernetes control plane failure
Zonal or regional network failure
Load balancer failure between two components, or simply a misconfiguration of some part of the Kubernetes cluster that hosts the Gloo management server

Given the variety of potential failures, Gloo Platform shouldn’t be responsible for determining which management server is the primary server. To solve this problem, Gloo Platform requires a DNS record that resolves to the currently active management server. This provides the most flexibility for different clouds and hosting topologies.

Configuration Consistency

To support configuration consistency during or after a failover, all Kubernetes resources must be written to all Kubernetes clusters that host a management server. This could be done with simple CI pipelines that write to passive and active management clusters in serial, or using GitOps tooling such as ArgoCD or Flux that implement a pull model to keep Kubernetes resources in sync across cluster fleets.

The Gloo Platform management server uses Redis to store processed configurations for all connected agents from workload clusters. We recommend using a shared Redis datastore for all management servers to prevent configuration drift and reduce the time it takes for a newly active management server to assume control and be able to accept new configuration changes.

When shared Redis is used, all management servers have a complete and identical view of the configuration state of all workload servers in the data plane. At the time of failover, the newly active management server will honor configurations in place even if all agents have not yet reconnected — this means no configuration drift throughout the recovery process. The new management server will also have greatly reduced warmup time as it will not need to re-process all configurations.

Multi-Region AWS Demo

Check out Solo.io’s AWS multi-region demo on YouTube! In the demo we deploy multiple workload and management servers to two regions and show how Gloo Platform handles workload and whole-region failures in AWS.

Gloo Platform: Stay Operational During Regional Outages with a Highly Available Management Plane

Management vs. Data Plane

Management Plane: Multi-Zone and Multi-Region

What is Considered a Failure?

Configuration Consistency

Multi-Region AWS Demo

Featured content

How Ambient Mesh Delivers Advanced Resource and Cost Savings

Getting Started with Ambient Mesh: From 0 to 100 mph

Agent Discovery, Naming, and Resolution - the Missing Pieces to A2A

Part Two: MCP Authorization The Hard Way

Part One: MCP Authorization The Hard Way

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

Ingenico

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Ambient Mesh Lab: Introduction to ztunnel in Ambient Mesh

Solo Academy Course: Service Mesh Basics

Solo Academy Course: Istio Basics

Solo Academy Course: Envoy Basics

Solo Academy Course: API Gateway Basics

Solo Academy Course: Get Started with Istio Service Mesh

Solo Academy Course: Introduction to Envoy Proxy

Solo Academy Course: Deploying Istio for Production

Kgateway Lab: Integrating kgateway with Istio at Ingress

Kgateway Lab: Kgateway as a Waypoint

Kgateway AI Lab: Consumption Reporting

Kgateway AI Lab: Deploying kgateway as an AI Gateway

Kagent Lab: How to build an AI agent

Kagent Lab: Integrate tools from MCP servers with kagent

Gloo AI Gateway Hands-On Lab: Semantic Caching

Kgateway AI Lab: Credentials Management