Multicluster Networking in Azure with Gloo Mesh: A Reference Architecture

Among the advantages of running in a public cloud is the ready access to multiple geographic compute zones. By deploying your applications closer to where your data and customers reside, you can reduce latency, traffic, and costs while improving the user experience.

However, these advantages come with increased complexity and the responsibility of ensuring secure communication between clusters, effectively creating and maintaining a multicluster network.

Large cloud providers like Microsoft Azure make it easy to deploy large, complex, multi-region Kubernetes clusters. This is extensively described in the AKS baseline for multi-region clusters reference architecture, and leverages important design patterns like hub-and-spoke networking model as well as a repeatable, modular infrastructure for the regional deployments. Currently, Azure is present in 59 regions spanning a total of 113 availability zones. Regional clusters can thus support further high-availability by spreading the worker nodes over multiple AZs achieving multiple layers of availability in case of regional and/or zonal breakage.

This blog post will give a high level architecture review of how service mesh across a multicluster network can be accomplished using Gloo Mesh’s Management Plane, Gloo Gateway, and Microsoft Azure. We will explore how we can augment and modify the original reference architecture Microsoft recommends into a more advanced design that supports cross-cluster service communication, instant failover and unified monitoring and management through the use of Istio service mesh managed by Gloo, consisting of Gloo Gateway and Gloo Mesh.

Adopting A Service Mesh

Adopting a service mesh has multiple advantages, chief among those being:

Security

Automatic traffic encryption with mTLS and strong workload identification
SPIRE/SPIFFE integration
Configuring a defensive network of least privileged connections

Traffic Control

Progressive traffic shifting and canary deployments
Integration into progressive delivery tools like ArgoRollouts
Ability to mirror/split/shift traffic
Retry/redirect logic built out as part of infrastructure instead of within applications

Observability

OpenTelemetry integration
Envoy metrics out of the box
Easy access to tracing requests across multiple services
Observability controlled and generated at the configuration level, not from within applications

Additionally, having an integrated API gateway as the point of ingress for north/south traffic in the cluster(s) allows full, programmatic control over the cluster and traffic configuration through native Kubernetes Custom Resource Definitions (CRDs).

Cluster Deployment

In this architecture we won’t consider the geographical availability of the container registries, but we refer to the AKS baseline architecture for further explanation. In general, multi-region cluster deployment follows a simple pattern that is an evolution of a single cluster adoption of Kubernetes.

The simple way of looking at it is this:

Each desired region has a load balancer and an AKS cluster with a gateway deployed
Azure Front Door handles edge traffic and routes to an appropriate region
Azure Load Balancers route traffic inter-region to the gateway on the appropriate AKS cluster

The above is what constitutes a “spoke” in the proposed hub and spoke model (see below).

The “hub,” where centralized observability, management and control is executed is an AKS cluster that has network routes to the rest of the “spokes.”

On this cluster lives the Gloo Mesh Management server
The OpenTelemetry Gateway that collects data from the rest of the Otel collectors
Optionally, any other centralized systems like ArgoCD/FluxCD

Think of the Hub cluster as the brains of the operation. It has no traffic routed through it, so it can be deployed in any region so long as it is connected to the spokes. It is not a single point of failure for production traffic as it is out of the traffic path for workloads. It is critical in the sense that it needs to be operating, but not in the sense that an interruption here means an outage for your company.

Learn how OpenTelemetry Drives Gloo Platform’s Graph.

Azure Components

Azure has a great deal of powerful components we will leverage to complete this architecture.

Azure Front Door – Used as the literal front-door and region routing from the edge
Azure Kubernetes Service – Azure’s managed Kubernetes offering
Azure Monitor – Comprehensive Monitoring service that accepts OpenTelemetry
Azure Key Vault – Azure Key storage for the RootCA needed by Gloo Mesh

Gloo Components

Gloo Mesh – This is the management plane that sits above all other clusters and controls their configuration, collects information, and executes policy.
Gloo Gateway – This is a modern API gateway that enables our infrastructure to do more demanding tasks than a standard ingress can handle. Among them, Oauth, rate limiting, API Portals with usage plans and account limits, WAF, etc.

Network Architecture

We deviate from the baseline in the adoption and implementation of Istio service mesh for ingress/egress gateway (using Gloo Gateway) and for inter-cluster communication (Gloo Mesh, leveraging the multicluster capabilities of Istio and the east/west Istio gateways). This has multiple advantages over the architecture based on application gateways and Azure Firewalls:

The whole architecture can be expressed declaratively as a series of native Kubernetes resources. Helm is expected and welcomed.
Hub/Spoke architecture works well for cloud native applications and especially if combined with the multicluster networking capabilities of Istio Service Mesh. This model also allows for easy growth and failover (add and remove spokes as needed).
Consistency can be enforced across multiple clusters by leveraging the Gloo Mesh agent/server architecture.
True end-to-end mTLS encryption can be achieved, with certificates automatically managed, renewed, and distributed by the Gloo Mesh server, which in turn can use Azure Keyvault to store the Certificate Authority (synchronized to a Kubernetes secret using the external secret operator). There’s no need to re-encrypt traffic at the edge and manage two separate sources of certificates.
Failover can occur even in the presence of a partial region outage. If only a few services are unavailable in any given cluster, Istio will make sure that the remaining clients can talk to remote versions of those services, thanks to multi-cluster networking via east/west gateways.
Progressive delivery of services is possible via native Istio features for traffic shifting, effectively giving the choice of upgrading services and clusters in place or treating entire clusters as cattle – that is, spinning up “immutable” clusters and their relative services every time a service undergoes an upgrade. The latter is a safer path to upgrades and allows for A/B and Blue/Green application deployments across clusters.
OpenTelemetry across cluster and region is handled by the Otel collectors on each cluster that collect and ship data to the centralized OpenTelemetry Gateway. The Gateway then exports to monitoring systems (in this case Azure Monitor)
The Gloo Management Plane also improves visibility of the metrics provided with a single pane of glass to observe all clusters, overall cluster health / versions, configuration, etc. This becomes a central place to get the overall status of the entire service mesh and the clusters it’s running on.

Where Gloo Platform Fits in vs. OSS Istio Multicluster vs. Azure Istio

The more experienced users of Istio may be wondering where Gloo Mesh deviates from running a multicluster OSS Istio implementation, or even trying to string together multiple clusters running Azure’s Istio offering.

Solo.io is a huge contributor to OSS Istio. Louis Ryan (the cofounder of Istio), John Howard (the largest single contributor), and Lin Sun (a long time Technical Oversight Committee member) all work for Solo. It should be understood that there is a clear focus of Istio to operate as a great service mesh first. Things like multi-primary or primary-secondary clusters are very complex to operate and are ultimately a merging of control planes, not a management plane that exists to handle multiple control planes. When control planes are multiplied across numerous regions (using OSS Istio), keeping control planes synced can become a costly crippling bottleneck.

Gloo Mesh’s management plane sidesteps this issue by controlling policy across all clusters and only applying policy and allowing traffic where needed. In addition, Gloo Mesh extends Istio custom resources to provide multicluster policies that are applied to many clusters at once instead of a 1:1 configuration to cluster ratio. This keeps Gloo Mesh users from having to apply policies per cluster, which becomes a lot of toil and burden.

These are just a few benefits to Gloo Mesh that are specific to multicluster setups. Beyond this there are things that benefit any Istio user (FIPS compliance, Istio Lifecycle Manager, Insights, etc.), as well as being supported by a company that is driving the underlying OSS Istio technology forward.

Azure Istio is a fantastic way to get into a simple single cluster service mesh. It simplifies some of the complex aspects of OSS Istio and helps with integration into Azure’s components. However, it is restrictive and has no added benefits to multicluster / multi-region architectures. In fact some of the restrictions in place make it more challenging. If you find OSS Istio too daunting, it would be worth looking at Azure Istio. However the larger arc of OSS Istio has been towards simplifying operations, and Istio Ambient Mode negates nearly all advantages of Azure Istio.

Integrating Azure and Gloo Mesh

Azure has become a mainstay among enterprises as one of the principal hyperscale cloud providers since its inception in 2008. It provides every possible piece of infrastructure to build resilient, cloud native applications in the cloud, and we think that with the addition of service mesh and API gateways (as demonstrated by the recent pivot toward Istio from the AKS team), it can prove to be an even better platform to build even more dynamic, secure and scalable applications. Azure’s native components complement and integrate extremely well with Gloo Mesh to make a very powerful solution.

Solo already has customers running Gloo Mesh on Azure at scale in production and are looking forward to further integration with Azure’s ecosystem. If you have any questions please reach out to Solo or join our Slack.

Multicluster Networking in Azure with Gloo Mesh: A Reference Architecture

Adopting A Service Mesh

Security

Traffic Control

Observability

Cluster Deployment

Azure Components

Gloo Components

Network Architecture

Where Gloo Platform Fits in vs. OSS Istio Multicluster vs. Azure Istio

Integrating Azure and Gloo Mesh

Featured content

Fortifying Your Cloud Native Connectivity Security Posture with Solo and Ambient Mesh

Migrating from Sidecars to Ambient Mesh - Risks, Challenges, and Benefits

Overhaul of Agent Gateway supporting A2A, MCP, and Kubernetes Gateway API

How Ambient Mesh Delivers Advanced Resource and Cost Savings

Getting Started with Ambient Mesh: From 0 to 100 mph

Agent Discovery, Naming, and Resolution - the Missing Pieces to A2A

Part Two: MCP Authorization The Hard Way

Part One: MCP Authorization The Hard Way

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

Ingenico

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Ambient Mesh Lab: EnvoyFilter Support

Ambient Mesh Lab: SPIRE integration with Gloo Mesh in Istio Ambient Mode

Ambient Mesh Lab: Introduction to ztunnel in Ambient Mesh

Solo Academy Course: Service Mesh Basics

Solo Academy Course: Istio Basics

Solo Academy Course: Envoy Basics

Solo Academy Course: API Gateway Basics

Solo Academy Course: Get Started with Istio Service Mesh