Building a Control Plane for Envoy
In this series of blogs, we will share our experience of building Gloo, a multi-purpose control plane for the Envoy proxy. The first blog in the series will focus on Envoy design, and the technical architecture decisions we needed to make while building the first layer of the control plane.
The Envoy proxy was originally designed by the Lyft engineering team as a universal data plane. The strength of Envoy stems from a combination of performance, extensibility and dynamic configuration. This power, however, comes at the price of increased configuration complexity. In fact, Envoy configuration is meant to be machine-generated by a management layer, often called a ‘control plane’.
Writing a control plane for Envoy is not an easy task, as described in a recent post from Ambassador creators, which details their repeated attempts to build it. In the series of blogs that follows, we will share our own experience, our design choices, and our implementation considerations, which allowed us to harness the power of Envoy.
How does envoy work?
Before we talk about building a control plane, let us briefly recall how Envoy works as an edge proxy.
When a request arrives at the cluster, Envoy can do many things with it, including routing, checking for permissions, load balancing and much more. To know what to do Envoy needs to be configured, which is the role of the control plane.
The first thing Envoy does with a request is to determine what is its destination. To do this, it uses the virtual host (host or :authority header) and a route table. The request is then sent through a sequence of filters, known as the filter chain. Each filter in the chain does different things, including modifying the request, holding the request until some event happens, rejecting the request, and more. In some cases, a filter needs to get information from an external server in order to perform its task. The last filter in the chain is always the router filter, which sends the request upstream. An example of a path taken by a request as it goes through Envoy is depicted in the figure below.
On the way back, a response can also go through a filter chain before it is sent back from the cluster. Once again, these filters may or may not interact with external servers, as needed.
Setting up the filter chain and configuring the different filters and their external servers are among the jobs of the control plane. This configuration information is sent to Envoy using the xDS API. It is the responsibility of the control plane to make sure that Envoy is always updated, and inform Envoy of any need to alter its configuration. Since in distributed systems “there is no now”, Envoy’s strategy is to follows an eventually consistent configuration model to handle configuration changes.
All the configuration information is sent from the control plane to Envoy asynchronously. This means that when a request arrives, it finds Envoy fully configured, and does not need to wait for the control plane to intervene. Thus, no latency incurs due to the control plane actions.
On the other hand, the filters and the external servers they use do sit on the data path and may result in latencies. Caching can be implemented to mitigate such potential issues, and the control plane can set timeouts for these services. Tracing can then be used to diagnose problems caused by filters and servers.
- Envoy is very extensible, making it easy for us to extend it for new use cases (more on that later).
- Even in the fast pace Envoy is moving, it is a very stable piece of software, with unit tests, integration tests covering more than 98% of the code (guaranteed by CI).
- Last but not least, Envoy has strong backing from the community. It is the driving force behind several successful service mesh projects, including Istio, and enjoys contributions from many key players in the ecosystem, such as Google, Pivotal, Red Hat and IBM. The Envoy community enjoys a truly collaborative spirit, which makes contributing to the project a rewarding and enjoyable experience.
Install and Configuration
Now that we understand how requests and responses are handled by Envoy and identify the roles of the control plane, we are ready to look at the design of a control plane. Below we describe what choices involved in this process, and explain the choices we’ve made in designing Gloo.
The first choice to make is how to deploy Envoy. Envoy can either be deployed in a VM or as a container. Both options have their advantages, and Gloo supports both, as well as multiple container management platforms. For clarity, we focus here only running Envoy as a Kubernetes-managed container.
In Kubernetes, Envoy runs typically as either a Deployment (which allows running a specified number of containers and scale them up and down) or a DaemonSet (which runs exactly one container per cluster node). The official Envoy container can be found here.
For Gloo, we chose to use a Kubernetes Deployment for running Envoy in the cluster. To expose Envoy to the outside world, we use a Kubernetes LoadBalancer service. This service provisions a cloud-specific external load balancer that forwards traffic to the Envoy instance.
While Envoy is usually configured via the control plane, it does need some initial configuration, called bootstrap configuration in the Envoy lingo. This configuration contains information on how to connect to the control plane (management server), the identity of the current Envoy instance, tracing, admin interface configuration, and static resources (see more on Envoy resources in the management section below).
To provide Envoy with its bootstrap configuration, we use a Kubernetes ConfigMap object, the common way to distribute configuration information in Kubernetes. We mount our ConfigMap to the Envoy pod as a file, which de-couples the bootstrap configuration file from the container. In a Kubernetes environment, we use a template for the configuration and generate a unique configuration file for each Envoy instance which contains its specific identity. Having an instance-specific ID is helpful for observability.
Once Envoy starts up, it connects to the management server, as specified in the bootstrap config, to complete its configuration and update it continuously.
Envoy can be configured dynamically in real time without any downtime. To achieve this, Envoy defines a set of APIs commonly known as the xDS protocol. Starting from Envoy’s v2, this is a streaming gRPC channel which Envoy watches for configuration updates from the control plane. Most aspects of Envoy can be configured this way. These dynamic resources include
- Listener discovery service — configures on what ports envoy listens on, and the action to take on the incoming connections.
- Cluster discovery service — configures the upstream clusters. Envoy will route incoming connections/requests to these clusters.
- Route discovery service — configures L7 routes for incoming requests.
- Endpoint discovery service — allows envoy to dynamically discover cluster membership and health information.
- Secret discovery service — allows envoy to discover ssl secrets. This is used to configure ssl secrets independently of the listener, and allows to provide ssl secrets from a local node, instead of a centralized control plane.
Gloo implements an aggregate discovery server, known as ADS. ADS aggregates all the xDS resources into one streaming channel. We chose this option as it is simple to use and requires a single connection from the Envoy instance to the management server.
Observability & troubleshooting
Envoy offers many stats, including one (connected_state) that indicates if it is connected or not to its management server. These stats can be scraped by Prometheus, or sent to statsd. Envoy can also emit Zipkin traces (and others, like Datadog and OpenTracing). In the Enterprise version of Gloo we take advantage of these capabilities and include Prometheus and Grafana deployments with pre-configured dashboards.
The use of a gRPC API (as compared with the REST API, available in previous versions of Envoy) has many advantages but one shortcoming: it is harder to manually query for debugging purposes. To address this issue, we developed a utility that queries the management server and prints out the xDS configuration as it is presented to Envoy.
When dealing with complex configuration, mistakes can happen. When Envoy is provided with invalid configuration it notifies the management server by sending a NACK in the xDS protocol, informing the management server of an error state (which may be temporary as Envoy follows an eventually consistent configuration model). In our Enterprise version of Gloo, we detect these NACKs and expose them as a metric.
Many aspects of Envoy’s behavior can be extended by adding new filters to the filter chain, as described above. These filters can modify the request, impact routing, and emit metrics. Some interesting Envoy filters include
- Ratelimit — rate limit requests. Once the rate is over the limit, envoy does not pass requests upstream and instead returns 429 to the caller.
- Gzip — supports gzip encoding, compressing responses on the fly.
- Buffer — buffers complete requests before forwarding them upstream.
- ExtAuthz — allows configurable authentication / authorization for requests.
The ability to extend Envoy on the data path is very important as it allows processing to be done very fast with no need to send the request to an additional proxy thereby reducing latency and increasing overall performance. The control-plane can enable or disable these extensions at runtime ensuring that the data path only includes what’s needed.
Envoy’s extensible design allows us to use upstream Envoy, and extend it by providing a range of filters we developed in-house. Some of the filters we developed include:
- AWS — allow calling AWS lambda functions using AWS v4 signature for authentication.
- NATS Streaming — convert http requests to NATS streaming pub.
- Transformation — advanced request body and header transformations.
When we designed Gloo we sought to make it highly extensible, sharing the same design principle as Envoy. To this end, we based the architecture of Gloo on plugins. Typically, Gloo provides a plugin for each Envoy filter, tasked with configuring that filter. This allows us to rapidly support new envoy extensions as soon as they become available.
Gloo Enterprise also includes a caching filter, as well as external auth and rate-limit servers (covered in a future blog post).
Upgrades & Rollback
As we ship an extended version of Envoy, it is very important for us to stay close to upstream Envoy. To achieve this, our repository structure and CI closely resembles those of Envoy.
Envoy master branch is always considered RC quality. We therefore make sure frequently that Gloo can be built with the latest master code. This ensures that we always have the latest Envoy features and optimizations, and can respond quickly if a security update is needed. Building Envoy on the 32-core CI we use takes about 10 minutes.
Even the most carefully-designed and well-tested application may face issues when deployed to the complex environment in production. To mitigate this risk, it is often advisable to take a canary deployment approach, whereby a new feature is deployed first to only a subset of the users or systems and is only deployed broadly after careful monitoring. With Gloo, we take this approach at several levels. First, we allow new versions of the configuration to be initially deployed to a small fraction of Envoy instances, where we can test the configuration. Second, when a new version of Envoy becomes available, we roll out this new version gradually. Finally, when we’re ready to release a new version of our control plane, we first connect the new version to only a small number of Envoy instances, where we can carefully validate the expected behaviors.
In the next blog of this series, we will look at our design of Gloo as a control plane for Envoy, the architecture choices we made, and the tradeoffs they represent.