Building a Control Plane for Envoy

In this series of blogs, we will share our experience of building Gloo, a multi-purpose control plane for the Envoy proxy. The first blog in the series will focus on Envoy design, and the technical architecture decisions we needed to make while building the first layer of the control plane.

The Envoy proxy was originally designed by the Lyft engineering team as a universal data plane. The strength of Envoy stems from a combination of performance, extensibility and dynamic configuration. This power, however, comes at the price of increased configuration complexity. In fact, Envoy configuration is meant to be machine-generated by a management layer, often called a ‘control plane’.

Writing a control plane for Envoy is not an easy task, as described in a recent post from Ambassador creators, which details their repeated attempts to build it. In the series of blogs that follows, we will share our own experience, our design choices, and our implementation considerations, which allowed us to harness the power of Envoy.

How does envoy work?

Before we talk about building a control plane, let us briefly recall how Envoy works as an edge proxy.

When a request arrives at the cluster, Envoy can do many things with it, including routing, checking for permissions, load balancing and much more. To know what to do Envoy needs to be configured, which is the role of the control plane.

The first thing Envoy does with a request is to determine what is its destination. To do this, it uses the virtual host (host or :authority header) and a route table. The request is then sent through a sequence of filters, known as the filter chain. Each filter in the chain does different things, including modifying the request, holding the request until some event happens, rejecting the request, and more. In some cases, a filter needs to get information from an external server in order to perform its task. The last filter in the chain is always the router filter, which sends the request upstream. An example of a path taken by a request as it goes through Envoy is depicted in the figure below.

On the way back, a response can also go through a filter chain before it is sent back from the cluster. Once again, these filters may or may not interact with external servers, as needed.

Setting up the filter chain and configuring the different filters and their external servers are among the jobs of the control plane. This configuration information is sent to Envoy using the xDS API. It is the responsibility of the control plane to make sure that Envoy is always updated, and inform Envoy of any need to alter its configuration. Since in distributed systems “there is no now”, Envoy’s strategy is to follows an eventually consistent configuration model to handle configuration changes.

All the configuration information is sent from the control plane to Envoy asynchronously. This means that when a request arrives, it finds Envoy fully configured, and does not need to wait for the control plane to intervene. Thus, no latency incurs due to the control plane actions.

On the other hand, the filters and the external servers they use do sit on the data path and may result in latencies. Caching can be implemented to mitigate such potential issues, and the control plane can set timeouts for these services. Tracing can then be used to diagnose problems caused by filters and servers.

Why Envoy?

The many strong qualities of Envoy have been described in details before (for example, here and here). Of these, the following were the most relevant for us:

Envoy is very extensible, making it easy for us to extend it for new use cases (more on that later).
Even in the fast pace Envoy is moving, it is a very stable piece of software, with unit tests, integration tests covering more than 98% of the code (guaranteed by CI).
Last but not least, Envoy has strong backing from the community. It is the driving force behind several successful service mesh projects, including Istio, and enjoys contributions from many key players in the ecosystem, such as Google, Pivotal, Red Hat and IBM. The Envoy community enjoys a truly collaborative spirit, which makes contributing to the project a rewarding and enjoyable experience.

Install and Configuration

Now that we understand how requests and responses are handled by Envoy and identify the roles of the control plane, we are ready to look at the design of a control plane. Below we describe what choices involved in this process, and explain the choices we’ve made in designing Gloo.

The first choice to make is how to deploy Envoy. Envoy can either be deployed in a VM or as a container. Both options have their advantages, and Gloo supports both, as well as multiple container management platforms. For clarity, we focus here only running Envoy as a Kubernetes-managed container.

In Kubernetes, Envoy runs typically as either a Deployment (which allows running a specified number of containers and scale them up and down) or a DaemonSet (which runs exactly one container per cluster node). The official Envoy container can be found here.

For Gloo, we chose to use a Kubernetes Deployment for running Envoy in the cluster. To expose Envoy to the outside world, we use a Kubernetes LoadBalancer service. This service provisions a cloud-specific external load balancer that forwards traffic to the Envoy instance.

While Envoy is usually configured via the control plane, it does need some initial configuration, called bootstrap configuration in the Envoy lingo. This configuration contains information on how to connect to the control plane (management server), the identity of the current Envoy instance, tracing, admin interface configuration, and static resources (see more on Envoy resources in the management section below).

To provide Envoy with its bootstrap configuration, we use a Kubernetes ConfigMap object, the common way to distribute configuration information in Kubernetes. We mount our ConfigMap to the Envoy pod as a file, which de-couples the bootstrap configuration file from the container. In a Kubernetes environment, we use a template for the configuration and generate a unique configuration file for each Envoy instance which contains its specific identity. Having an instance-specific ID is helpful for observability.

Once Envoy starts up, it connects to the management server, as specified in the bootstrap config, to complete its configuration and update it continuously.

Management

Envoy can be configured dynamically in real time without any downtime. To achieve this, Envoy defines a set of APIs commonly known as the xDS protocol. Starting from Envoy’s v2, this is a streaming gRPC channel which Envoy watches for configuration updates from the control plane. Most aspects of Envoy can be configured this way. These dynamic resources include

Listener discovery service — configures on what ports envoy listens on, and the action to take on the incoming connections.
Cluster discovery service — configures the upstream clusters. Envoy will route incoming connections/requests to these clusters.
Route discovery service — configures L7 routes for incoming requests.
Endpoint discovery service — allows envoy to dynamically discover cluster membership and health information.
Secret discovery service — allows envoy to discover ssl secrets. This is used to configure ssl secrets independently of the listener, and allows to provide ssl secrets from a local node, instead of a centralized control plane.

Gloo implements an aggregate discovery server, known as ADS. ADS aggregates all the xDS resources into one streaming channel. We chose this option as it is simple to use and requires a single connection from the Envoy instance to the management server.

Observability & troubleshooting

Envoy offers many stats, including one (connected_state) that indicates if it is connected or not to its management server. These stats can be scraped by Prometheus, or sent to statsd. Envoy can also emit Zipkin traces (and others, like Datadog and OpenTracing). In the Enterprise version of Gloo we take advantage of these capabilities and include Prometheus and Grafana deployments with pre-configured dashboards.

The use of a gRPC API (as compared with the REST API, available in previous versions of Envoy) has many advantages but one shortcoming: it is harder to manually query for debugging purposes. To address this issue, we developed a utility that queries the management server and prints out the xDS configuration as it is presented to Envoy.

When dealing with complex configuration, mistakes can happen. When Envoy is provided with invalid configuration it notifies the management server by sending a NACK in the xDS protocol, informing the management server of an error state (which may be temporary as Envoy follows an eventually consistent configuration model). In our Enterprise version of Gloo, we detect these NACKs and expose them as a metric.

Extensibility

Many aspects of Envoy’s behavior can be extended by adding new filters to the filter chain, as described above. These filters can modify the request, impact routing, and emit metrics. Some interesting Envoy filters include

Ratelimit — rate limit requests. Once the rate is over the limit, envoy does not pass requests upstream and instead returns 429 to the caller.
Gzip — supports gzip encoding, compressing responses on the fly.
Buffer — buffers complete requests before forwarding them upstream.
ExtAuthz — allows configurable authentication / authorization for requests.

The ability to extend Envoy on the data path is very important as it allows processing to be done very fast with no need to send the request to an additional proxy thereby reducing latency and increasing overall performance. The control-plane can enable or disable these extensions at runtime ensuring that the data path only includes what’s needed.

Envoy’s extensible design allows us to use upstream Envoy, and extend it by providing a range of filters we developed in-house. Some of the filters we developed include:

AWS — allow calling AWS lambda functions using AWS v4 signature for authentication.
NATS Streaming — convert http requests to NATS streaming pub.
Transformation — advanced request body and header transformations.

When we designed Gloo we sought to make it highly extensible, sharing the same design principle as Envoy. To this end, we based the architecture of Gloo on plugins. Typically, Gloo provides a plugin for each Envoy filter, tasked with configuring that filter. This allows us to rapidly support new envoy extensions as soon as they become available.

Gloo Enterprise also includes a caching filter, as well as external auth and rate-limit servers (covered in a future blog post).

Upgrades & Rollback

As we ship an extended version of Envoy, it is very important for us to stay close to upstream Envoy. To achieve this, our repository structure and CI closely resembles those of Envoy.

Envoy master branch is always considered RC quality. We therefore make sure frequently that Gloo can be built with the latest master code. This ensures that we always have the latest Envoy features and optimizations, and can respond quickly if a security update is needed. Building Envoy on the 32-core CI we use takes about 10 minutes.

Even the most carefully-designed and well-tested application may face issues when deployed to the complex environment in production. To mitigate this risk, it is often advisable to take a canary deployment approach, whereby a new feature is deployed first to only a subset of the users or systems and is only deployed broadly after careful monitoring. With Gloo, we take this approach at several levels. First, we allow new versions of the configuration to be initially deployed to a small fraction of Envoy instances, where we can test the configuration. Second, when a new version of Envoy becomes available, we roll out this new version gradually. Finally, when we’re ready to release a new version of our control plane, we first connect the new version to only a small number of Envoy instances, where we can carefully validate the expected behaviors.

In the next blog of this series, we will look at our design of Gloo as a control plane for Envoy, the architecture choices we made, and the tradeoffs they represent.

‍

Building a Control Plane for Envoy

How does envoy work?

Why Envoy?

Install and Configuration

Management

Observability & troubleshooting

Extensibility

Upgrades & Rollback

Featured content

How Ambient Mesh Delivers Advanced Resource and Cost Savings

Getting Started with Ambient Mesh: From 0 to 100 mph

Agent Discovery, Naming, and Resolution - the Missing Pieces to A2A

Part Two: MCP Authorization The Hard Way

Part One: MCP Authorization The Hard Way

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

Ingenico

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Ambient Mesh Lab: SPIRE integration with Gloo Mesh in Istio Ambient Mode

Ambient Mesh Lab: Introduction to ztunnel in Ambient Mesh

Solo Academy Course: Service Mesh Basics

Solo Academy Course: Istio Basics

Solo Academy Course: Envoy Basics

Solo Academy Course: API Gateway Basics

Solo Academy Course: Get Started with Istio Service Mesh

Solo Academy Course: Introduction to Envoy Proxy

Solo Academy Course: Deploying Istio for Production

Kgateway Lab: Integrating kgateway with Istio at Ingress

Kgateway Lab: Kgateway as a Waypoint

Kgateway AI Lab: Consumption Reporting

Kgateway AI Lab: Deploying kgateway as an AI Gateway

Kagent Lab: How to build an AI agent