Industry

Scaling Cilium to New Heights With xDS

At Solo.io, we are excited about the Cilium project which introduced innovations to container and cloud-native networking by leveraging eBPF in its data plane for programmability, performance, and observability. The Cilium control plane, which configures the data plane, has some room for improvement for large, scaled-out environments. To help push Cilium to new heights, we have been working with others in the Cilium community including Isovalent, Google, and Microsoft, to use xDS to drive control-plane improvements in Cilium

In this blog, we explore the current Cilium control-plane design, where and why limitations may arise for large deployments, and how the community could advance this architecture using the CNCF universal data plane (xDS) APIs.

Understanding Cilium’s Control Plane Architecture

Cilium follows a common networking architecture based on a “data plane” and a “control plane”. In Cilium, the data plane is deployed on each host (or Kubernetes node) and consists of eBPF programs to handle L3/L4 connectivity & policy. For completeness, Cilium also uses an Envoy proxy in its data plane for L7 policy, but we will omit this for simplicity.

The Cilium control plane is implemented as a cilium-agent daemon that’s deployed on each Kubernetes node. Each cilium-agent is a separate, independent instance of the control plane.

The cilium-agent connects to the Kubernetes API server and watches for configuration changes which it uses to configure the data plane. The cilium-agent also writes configuration to the Kubernetes API representing endpoints or identities being created on its respective node. 

For example, when a Pod is started on a Kubernetes node, the cilium-agent is responsible for writing a CiliumEndpoint custom resource (CR) and potentially a CiliumIdentity CR representing the Pod’s network identity. The cilium-agent will also update any eBPF maps on the node related to identity and endpoint mapping. Other cilium-agents on other Kubernetes nodes also watch when these new CiliumEndpoint and CiliumIdentity CRs are created and update their local eBPF data plane for policy enforcement. This mechanism is able to reconcile global policy enforcement configuration on each node, so all nodes see the same enforcement behavior.

Best Practices for Building a Control Plane

We’ve blogged quite a bit in the past about best practices for building scalable, secure, efficient control planes. Before we dig deeper into scaling Cilium’s control plane, we should review some of those best practices. 

The data plane in a networking architecture should focus on being as simple as possible, highly performant, and efficient at doing what it needs to do: in this case, shuttle bytes back and forth, implement policy, and enforce security. The role of the control plane is to shield the data plane from complexity, and anything that distracts from its core role. 

On the other hand, users need to be able to specify configuration and policy in a form that best fits their user experience. Many times this is done through some kind of domain-specific configuration format. Something then needs to combine this higher-level user configuration with the infrastructure state and translate to the lower-level data-plane formats. Translation is half the battle. Something also needs to distribute the lower-level configuration to the data planes, and do so efficiently. That’s where a control plane comes into the picture.

The control plane allows configuration decoupling and integration with other parts of the platform that can then inform the data plane. In many ways, this diagram is similar to the three-tier architectures we use when building applications: the presentation layer, a decoupling business logic layer, and a data store.

In the case of a networking architecture, the control plane layer will be dealing with sensitive data like reading/writing to the Kubernetes API and creating network identities. As it’s a separate layer, we can secure and harden it and eliminate the need for the data plane to have privileges to perform these tasks. 

In many cases, although simpler to start, co-locating some of these layers causes inefficiencies, security issues, and scaling/coupling issues.

Scaling Considerations in Cilium’s Control Plane Architecture

Each and every cilium-agent in the cluster is responsible for updating its local data plane configuration with the global cluster configuration. Each cilium-agent may watch upwards of 15 types of CRD (footnote 1).

As the scale of a cluster grows in terms of nodes, pods, namespaces and network policies, so does the amount of work each cilium-agent must perform. The pressure to serve and update all these states causes the Kubernetes apiserver to suffer.

In large deployments, this pressure on the Kubernetes API server could end up slowing or even bringing down all operations across the cluster. 

The impact of computing global configuration state on each node really starts to magnify when you look at a common action in a cluster and how cilium-agent deals with it: workloads and namespaces being labeled, re-labeled, or unlabeled. 

The cilium-agent is responsible for writing the CiliumEndpoint and CiliumIdentity resources for pods scheduled onto its node. Since these resources are dependent on the combination of pod and namespace labels, a change in a label will cause all dependent resources to be updated. This causes a high volume of writes and resultant reads, as this state is then propagated to all cilium-agents who must react and reconfigure their local data plane.

apiVersion: cilium.io/v2 
  kind: CiliumIdentity  
  name: "50568”                                                             
  metadata:                                                                
    labels:                                                                            
      app: sleep                                                                       
      io.cilium.k8s.policy.cluster: default                                            
      io.cilium.k8s.policy.serviceaccount: sleep-v1
      io.kubernetes.pod.namespace: default                                             
      version: v1                  
  security-labels:                                                                     
    k8s:app: sleep        
    k8s:io.cilium.k8s.namespace.labels.team: loyalty
    k8s:io.cilium.k8s.namespace.labels.version: v10.45
    k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name: default
    k8s:io.cilium.k8s.policy.cluster: default                                          
    k8s:io.cilium.k8s.policy.serviceaccount: sleep-v1                                  
    k8s:io.kubernetes.pod.namespace: default
    k8s:version: v1

Code Listing 1: A CiliumIdentity resource combines pod and namespace labels. Changes to either force a recomputation and generation of a new CiliumIdentity.

Operations such as labeling a namespace in your cluster can be extremely expensive for Cilium and have the potential to grind API server operations to a crawl (see recommendations for including/excluding labels for identity purposes). For example, changing a label on namespaces in a moderately sized cluster can cause enough of a stampede for the cilium-agents to pressure the Kubernetes API server to a ~ 4 minute delay in responses. This would effectively stall all operations on the cluster.

Consider this environment:

  • 200 node Kubernetes cluster 
  • 5 namespaces
  • 50 deployments per namespace
  • 80 replicas per deployment (20k total pods)

During a test where we update labels across the namespaces, we see a spike in CPU utilization of around 150% and memory climb to around 1 GB across all of the nodes in the cluster.

Graph 1: CPU and memory spike across all nodes

Spiking the CPU and memory across all nodes in the cluster at the same time is undesirable behavior, however, even more serious is how the cilium-agent event reading and writing impacts the latency on the Kubernetes API server. In the following graph we see latency grow to between 3 and 4 minutes. This could most certainly lead to outages of various kinds! Unfortunately, because of this control-plane architecture, the typical way to deal with scaling issues by adding capacity does not work; in fact adding more nodes and/or more workloads amplifies this behavior.

Graph 2: Kubernetes API Server latency spike to 3 – 4 minutes

Taking Pressure Off of the Kubernetes API Server

For larger Cilium clusters, you can offload the pressure exerted on the Kubernetes API server by instead using a dedicated key-value store. The key-value store is used to store things like workload identities, endpoints, and IP-to-identity mapping. Instead of storing this information in Kubernetes Custom Resources (CRDs), Cilium will watch, manipulate, and write the objects directly in its own database. The Cilium Helm charts support installing etcd as a dedicated key-value store for this purpose.

As your cluster grows, it may be a good idea to use the kv-store to offload the reads/writes of the Cilium objects instead of pressuring the Kubernetes API server.

NOTE: There are other optimizations that Cilium makes to take pressure off the Kubernetes API server for things like Policy status updates. See the docs for k8s-events-handover for more

If we re-run the previous tests with a kv-store in place, we see reduced pressure on the Kubernetes API server, though not necessarily on CPU.

Graph 3: CPU spike across all nodes, memory stays within 300-400 MB range

Instead of 150% CPU consumption like in the previous scenario, CPU spikes around 100% and memory stays fairly constant around 300 to 400 MB. This is because recomputation and regeneration of the CiliumEndpoint and CiliumIdentity objects for each pod and each identity (Cilium creates all new identities when labels are changed and all eBPF maps that refer to the old identity must be updated) tends to take CPU horsepower to complete regardless of what backing store (CRD, kv-store, etc) are used.

If we observe the kv-store, we see that event ops/second and latency will spike during this namespace label/unlabel event:

Graph 4: Event ops and latency on the kv-store spike during this namespace label event

In this particular case, we see the kv-store take on quite a bit of load via event IO and we see latency for kv-store calls slows to about 15s or so. At the end of the day, this is better for overall cluster operations instead of saturating the Kubernetes API server with requests. In fact, in Graph 5 we can see the Kubernetes API server latency stays in an acceptable range of 10 to 40ms, versus what we saw in the previous example where latency went up to 4 minutes.

Graph 5: Kubernetes API Server latency in the 10 to 40ms range

Using the kv-store backend for Cilium objects is a good way to scale and take pressure off the Kubernetes API server, however, it comes with its own drawbacks. There are now two persistent stores to maintain and scale increasing operational burden. Running databases or persistent stores (in general) to support scale in production is no trivial feat. Recovery procedures are needed if they lose consistency as there are now two “sources of truth”. The Kubernetes API server and its storage are fully managed on many platforms but the kv-store is not . Since the overwhelming majority of the load is read, a cache will yield the same improvement for a fraction of the operational complexity.

Improving Cilium’s Control Plane Scaling, Security, and Efficiency with xDS

What if we could get the best of both worlds? Reduce pressure on the Kubernetes API server and eliminate the need for expensive production operations maintaining a separate data store? Maybe even bonus points for solving some other lingering Cilium scaling and security issues?

At Solo.io, we are excited to contribute to the broader xDS efforts in the Cilium community and help drive the project forward toward a scalable, secure, and efficient control plane. Using the xDS protocol allows us to scale to thousands or tens of thousands of nodes in a cluster.This approach solves a lot of the problems discussed above, as well as others, like single-node compromise blast radius and duplication of CiliumIdentity(s) at scale. Let’s see how it works.

The xDS protocol started as a dynamic way to configure the Envoy proxy, but was built with the idea that it could be used to support a “universal data plane”. The protocol has become an efficient way to synchronize state across multiple nodes for any data plane, and is now managed by a CNCF working group.

Ultimately, we want to eliminate all of the redundant work each cilium-agent is doing, centralize security-sensitive operations such as identity creation, and serve state to agents efficiently, all without compromising the reliability of Kubernetes as a whole. 

To do this, instead of each cilium-agent acting like a separate control plane, we will consider the cilium-agent to act as a simple read-only client to an intelligent, centralized control plane. The control plane will shield complex and permission sensitive  operations from the agent

The communication between the cilium-agent (now part of the data plane) and the control plane will be over the xDS protocol. No write operations will be permitted from the data plane to the control plane, therefore the cilium-agent will not need write access to the backend store (CRD/kv-store). The xDS control plane service can be secured and hardened and would be the only component that needs to support read/write operations to the Kubernetes API server.

This architecture reduces load on the Kubernetes API server, and does not require any externally managed data stores. It also looks a lot closer to the three-tier control-plane architecture discussed previously.

Graph 6: Initial tests of an xDS implementation show expected behavior in CPU and memory usage

In this architecture we do see CPU and memory overhead in the xDS control-plane pods, as expected, and we still get some CPU/memory processing overhead on each node as it still needs to manipulate the eBPF data plane. 

Other benefits this model brings are a smaller blast radius on node compromise,  and elimination of duplicate identities created by the cilium-agent. In the original architecture (CRD or kv-store), each node has a full control plane that needs special privileges to read and write CiliumEndpoints and CiliumIdentity(s). Compromise of the cilium-agent on one single node compromises the control plane and gives attackers access to affect other nodes. This can end up as a cluster-wide compromise. In the xDS model, the cilium-agent is permitted to read data (no writes) from the control plane and a compromise on a single cilium-agent does not give access to the entire control plane or the cluster. As mentioned earlier, the xDS control plane is considered a privileged component and can be locked down and secured, including running it completely outside of the cluster. 

Another benefit of the xDS approach is to eliminate the generation of duplicate identities in Cilium by centralizing identity creation. When each cilium-agent acts as an independent control plane in isolation of other nodes, as in the existing model, it tries to make decisions that could be duplicative. For example, when Pods get assigned to a node, the CNI is responsible for setup of the networking endpoint and when cilium-agent recognizes a new endpoint that doesn’t have an existing CiliumIdentity, it will try to create it. The same thing happens if namespace labels change and identities need to all be re-computed. Since identity creation happens independently across multiple nodes, there is a good chance that multiple CiliumIdentity(s) get created for the same identity (for an extreme case, this is easily reproducible as described here). In the xDS approach, CiliumIdentity is created centrally and this scenario is eliminated.

Conclusion

Cilium has a powerful data plane built on eBPF, but to get Cilium to operate effectively at scale, we can leverage the xDS protocol to improve the control-plane architecture. xDS is an efficient protocol and allows us to leverage the best practices for building a control plane that we’ve learned over the years. In fact, if we build the xDS control plane directly into the cilium-operator, there would be no net-new complexity added from this implementation. 

At Solo.io, we’ve been building cloud networking and service mesh technology for the past 8 years (with considerable prior experience), and we think leveraging xDS to build a more scalable control plane is a great path forward for scaling Cilium. To that end, we have recently launched Gloo Network for Cilium. We are looking forward to working with the Cilium community to design and implement this new architecture and if you’re interested to follow along or get involved, please join us in the Cilium community!

 

 

Footnotes:

1: Resources watched include: Pod, Node, Namespace, Endpoint, EndpointSlice, Secrets, NetworkPolicy, CiliumNetworkPolicy, CiliumClusterWideNetworkPolicy, CiliumCidrGroup, CiliumExternalWorkload, CiliumL2AnnouncementPolicies, CiliumLoadBalancerIpPools, CiliumNodeConfigs, CiliumPodIpPools, CiliumNode, CiliumIdentity, CiliumEndpoint