Gloo or Ambassador: why the control plane matters
Here at Solo.io, we work on Envoy-based technology like API/Edge Gateways, service mesh, and Web Assembly (wasm). We’ve had a lot of success deploying these technologies with our open-source community and paying customers. For example, our Gloo API Gateway is deployed at leading financial, telco, and retail companies. A big concern for those customers is performance, scale, and security. Each of their use cases differ as much as their environmental constraints, but the Gloo control plane is built to be flexible enough to accommodate these variations.
Part of building a distributed system that scales and is secure is appropriate separation of concerns. For example, in Gloo we separate the concerns of abstracting the environment-specific details and generation of Envoy configuration from the running Envoy proxy itself. Using common parlance for these systems, we’ve separated out the “control plane” from the “data plane”. This separation allows us to architect for performance, security, and system scalability. But not all API Gateways built on Envoy do this. Let’s go into more detail.
Control planes are different than data planes
The control plane integrates with backend registries (like the Kubernetes API, Consul, Vault, AWS, etc), translates end-user configuration, and processes all of this information to build valid Envoy configuration. For large, dynamic systems with reasonable rates of change, this can end up consuming lots of memory and lots of CPU cycles computing and recomputing the state of the proxy. Matt Klein in a recent article about building these types of control planes says:
“A high rate of change in the compute/network topology puts a substantial amount of pressure on the control plane; if not careful, a naive implementation is easily susceptible to runaway topology recomputations.”
On the other hand, the data plane (Envoy) is doing things like load balancing, routing, authZ/N, TLS encryption/decryption, and in some transformation use cases, actually altering the requests. The computation and associated resources needed for the data plane are different from the control plane. They should be separate; and in the Gloo API Gateway they are. In fact, Envoy was built with this remote xDS/configuration API in mind — being able to flexibly serve configuration to Envoy from a separate control plane.
Separation of the control plane and data plane is crucial for our customers and users to achieve performance, scale, and security goals. With the Gloo API Gateway the Envoy proxy runs in its own pod separate from the control plane and isplane is locked down and scales separately.
This architecture isn’t followed by all API Gateways built on Envoy. On the other hand, a proxy like Datawire Ambassador does not separate out the deployment of the control plane and data plane. They deploy Envoy co-located with a control plane in the same container, presumably to follow similar models of API Gateways from the past and to keep it simpler to operate. This has drawbacks that precipitate outcomes documented in a recent blog post on “Stress Testing API Gateways built on Envoy Proxy.”
Drawbacks of co-located control plane and data plane
Ambassador runs wholly contained as a single linux container (and pod). The single container runs the data plane (Envoy) as well as several other processes that make up the control plane. The processes that make up the control plane within the single Ambassador container play roles like serving the UI, querying the Kubernetes API/watching endpoints, and writing intermediate language configuration to disk. Other processes pick up these files, do some more processing and then again write to the file system. Eventually, a process called ambex picks up these intermediate files and serves configuration to Envoy over localhost xDS.
The internal details notwithstanding, with Ambassador you have a mix of control plane resource usage contending for the same resources needed to actually process requests. For example, in a moderately loaded system, Ambassador’s control plane can consume a lot of memory (as can be expected for a control plane) and potentially starve resources needed for the data plane. Scaling up the number of pods won’t help as each pod will have its own copy of the control plane; you’ll end up seeing the same problem across all the replicas. For every pod replica you scale up, you get a 1:1 mapping of control plane to data plane. This exact problem is what we see in the write up about testing API Gateways built on Envoy.
Another issue of this co-location of data plane and control plane is misaligned load scaling variables. As you scale your edge (by adding more pods), you linearly scale the load placed on your Kubernetes API since each pod has the control plane. Depending on how big you try to scale or how many other calls are made to your Kubernetes API, you could overwhelm the Kubernetes API and degrade your cluster.
The final issue with this co-located data plane and control plane is security. The control plane components need to interact with the rest of the data center components as described earlier and should be secured differently from the edge. For example, with Ambassador in Kubernetes, the control plane needs permission to access the Kubernetes API (ie, Roles, ClusterRoles, etc). The security issues arise from the pods at your edge (which handles potentially malicious requests) having privilege to read/write to your Kubernetes cluster. If your edge gets compromised, your cluster is at risk.
Separating the control plane from data plane
In Gloo, we use a separate control plane and data plane to alleviate these concerns. The data plane scales independently from the control plane and can be highly locked down (no Cluster Roles, no root user, no Kube API access tokens, read-only filesystem, etc). The data plane with this approach scales to meet ingress/edge load without putting more pressure on the Kubernetes API. The control-plane components can be scaled independently by, for example, giving more memory to the xDS components and restricting Kubernetes API access to only those components that need it (ie, discovery components).
Understand implementation of your API Gateway
If you are considering Envoy based infrastructure (you should!), especially at the edge, you should consider the underlying implementation and architecture of the project you choose. They are not equal just because they use Envoy as the data plane: the control plane architecture matters.