The operational overhead of Istio’s External Control Plane
At Solo.io, we work with customers with the largest deployments of Istio in the world where cross-cluster or multi-cluster mesh solves a lot of problems for them. As Istio continues to mature and expand into multi-cluster architectures, many have looked to how they can more effectively manage their Istio deployments. In this blog, we explore a bit more of the external control plane architecture, when to use it, the complexities it introduces, and what other options you may have for cross-cluster/multi-cluster Istio management.
For every release of Istio, more deployment models are added to the upstream documentation and users are wondering “which deployment model works for me and my architecture“. In this article, we will explore a newer deployment model known as the “external” control plane and help you decide whether this model is appropriate for your multi-cluster deployment.
There are two very distinct parts within Istio and they are often managed by different roles within a company. The control plane is the brains of the operation and is responsible for configuring every istio-proxy sidecar and gateway. In many organizations the control plane is managed by an operations team or platform role because it needs to be able to support the developer workloads. The second component, known as the data plane, is all of the sidecar proxies and gateways on which client requests flow. Since the data plane is in the critical path of of client requests, developers often share some ownership of the data plane with operations. Because of the separation of concerns between roles and the Istio architecture, many wonder if physically separating them might make sense. This is the external control plane architecture which we will explore below.
The Istio External Control Plane
Simply put, the external control plane architecture moves the Istio control plane outside of the cluster and mesh it is meant to administer. From an operators perspective, it’s a desirable architecture because you can manage the Istio control plane independently of the target mesh cluster and more cleanly separate responsibilities. This architecture, however, comes with some complexities and it’s important to understand before jumping in as we will explore below. Additionally, this is not a solution for cross-cluster/multi-cluster Istio management. Again, it’s intended to simplify the operational control for individual meshes.
A majority of users consume Kubernetes as a managed solution (EKS, AKS, GKE) or a standalone deployment (Kops). For Istio external control plane you will need to create a second cluster to host the Istio control plane as shown below.
Already you can start to see some of the complexities involved with externalizing istiod. First, we need to make the istiod application available to the mesh cluster. To do this we have to setup an ingress gateway on the control cluster. Since the responsibility of the control cluster ingress gateway is to proxy traffic within, you need to create a separate Istio control plane to manage it. Secondly, you have to securely connect the mesh cluster control plane to the Kubernetes API server running in the mesh cluster. Typically this involves copying a Kubernetes service account token secret from the mesh cluster into the control cluster, which is not an easy, nor secure thing to automate safely.
Failure Modes of the Istio External Control Plane
It is important to understand the ways you may be impacted by an outage of the control plane. Regardless of where istiod lives, a loss of communication with the control plane will cause issues with all mesh workloads and will impact the Kubernetes API Server. First, all workloads will be “frozen in time”, meaning they will not receive new configurations, endpoints, or certificates. Depending on how long the connectivity issues persist, this can affect client traffic in many ways. The Kubernetes API server will also no longer be able to accept Istio configurations in the form of Custom Resource Definitions due the the validation webhook not being able to connect.
Due to the above issues it is extremely important that the control cluster and mesh cluster share failure domains as much as possible, for example, being deployed to the same datacenter. By moving istiod outside of the mesh cluster, you introduce many more ways in which connectivity issues can arise. An alternative approach is to deploy the istiod control plane into each of the respective workload clusters and keep a tighter failure mode. This also has some complexities, which we address in the conclusion.
So when would it make more sense to use the external control plane architecture?
Kubernetes Clusters as a Service
Some larger organizations with their own data centers and public cloud environments often want to offer similar technology stacks throughout. They have the resources and means to even offer Kubernetes clusters as a service. These organizations often have more control over the architecture of how their Kubernetes clusters are built. For example, SAP uses the open source tool Gardner to provision clusters for its development teams. They use Seed clusters that run individual pods of Kubernetes API servers that Shoot clusters use to run their workloads.
The big advantage for the Istio external control plane in this environment is that the Kubernetes API already lives in a separate accessible cluster. That means we can deploy istiod right next to the Kubernetes API server which will address some of the previous concerns about connectivity loss.
Below we can see a much more simplified architecture using the Istio external control plane.
However, Gardner does not just run one Kubernetes API server per Seed cluster but instead runs many. It would then be desirable to also spin up an external Istio control plane per Kubernetes API server. This would allow the greatest flexibility to the end user because they will need some level of control of what/when Istio versions are deployed to each cluster. But an issue arises, not all Istio versions are compatible with each other, due to breaking changes in the CustomResourceDefinitions between minor versions. We would be limited to only allow compatible istiod versions to be deployed within a Seed cluster but this may conflict with the other responsibilities of said cluster.
To mitigate Istio version compatibility issues, it makes more sense to run the external control planes in their own managed clusters where versions can be tightly controlled independently of the Seed clusters. Obviously these clusters will need to run closely to the Seed clusters they service. Depending on you Seed cluster architecture you may be running more than one Istiod clusters per seed cluster.
Conclusions about Istio External Control Plane
Hopefully by now you will see that the external control plane architecture is a powerful way to separate the Istio deployment architecture to optimize for operational separation. It is intended to manage Istio control planes and abstract that from the end-user’s Kubernetes clusters. This architecture is ideally suited for an “as a Service” deployment but even so introduces complexity and blurs failure domains.
Istio’s external control plane does not focus on multi-cluster/cross-cluster management of a service mesh that many large organizations are looking for. Here at Solo.io we developed Gloo Mesh to fill this gap. Gloo Mesh becomes your external control plane and gives you much more multi-cluster functionality that does not exist in Istio today. With the addition of our newest product Gloo Mesh Gateway we are enabling users to do more with Istio than ever before.
Contact us for a free demo.