Istio Ambient Waypoint Proxy Deployment Model Explained
Ambient mode is the new sidecar-less data plane introduced in Istio in 2022 that reached Beta status in May this year. The Istio community is working hard to drive ambient to GA towards the next Istio v1.24 release. Ambient splits Istio’s functionality into two distinct layers, a secure overlay layer and a Layer 7 processing layer. As I work with a lot of users kicking the tires and trying ambient, I wanted to clarify a few common questions and confusions around the waypoint proxy, which is an optional component that is Envoy-based and handles L7 processing for workloads it manages.
What are the common deployment models for waypoint proxies?
Conceptualize waypoint proxy
You can simply think of a waypoint proxy as a gateway for a group of destinations, where the destinations could be one or more services and workloads from one or many namespaces. It is not that different from your ingress or egress gateway other than it handles L7 processing for traffic inside your cluster. This is also why when you deploy a waypoint proxy, you use a Kubernetes gateway resource to deploy it.
What I really love about the waypoint architecture is its flexibility and you can choose an architecture that suits your needs. Below are a few common patterns:
A: Waypoint proxy for a namespace
The most common pattern we observe is each team owns a namespace in a cluster and manages the services and deployments within the namespace. In this pattern, we expect each team to have its own waypoint proxy that is used by services within the namespace. Following this pattern, we designed the default waypoint deployment to be namespace scoped where it is to be used by all the services within the namespace. In the diagram below, services A, B, C are within the same namespace. All client traffic in ambient mesh to service A, B or C will go through the namespace waypoint proxy, programmed by the client’s ztunnel.
Note: Ztunnel is omitted on the destination side for simplicity
B: Waypoint proxy for some services
What if you don’t want traffic to service B to go through the waypoint proxy? One common reason is that you only need L4 enforcement on traffic to service B which eliminates the need for a hop through the waypoint proxy when calling it. For example, you have a web service that calls a MySQL database service for which you only need L4 enforcement. Another example is when services are ‘internal’, ones that are only called by other services within the namespace, and you only need authorization policies for services across namespace boundaries. For example, in the famous Bookinfo application, you can enable zero trust for the productpage service by denying all traffic except traffic into the productpage service with port 9080 on the GET method. For all internal services such as the reviews or details service, you continue allowing all of the traffic to the internal services within the namespace.
Another use-case is that service A has very heavy traffic, so you want a dedicated waypoint proxy to handle its L7 processing so that you can adjust its resource configuration to handle the heavy load. You can configure service B or C to use a different waypoint so it is not impacted by its noisy neighbor (service A) or skip it based on your need. I’ll touch more on this in the section below.
C: Multiple namespaces sharing a waypoint proxy
You can also configure multiple namespaces to share a waypoint proxy, by labeling the namespace with the istio.io/use-waypoint
label and configuring the allowRoutes
in the Gateway resource for the waypoint proxy. If you are running a small footprint cluster, where you want to optimize for resource efficiency and are less concerned about noisy-neighbor issues, you may want one waypoint proxy for the whole cluster. A common example of this is what is known as K8S-at-the-edge.
“Using waypoint for multiple namespaces is more resource efficient when the application spans between too many namespaces, without compromising both performances and application availability by elevation the concept of sidecarless service mesh and using centralized L7 capabilities using a different namespace,” said Ahmad Al-Masry, DevSecOps Engineering Manager.
Another example is a team which owns a few namespaces that form a single trust-domain and you want to keep resource costs down by sharing one waypoint proxy for them. In the diagram below, all services from the ns-1 and ns-2 namespaces are using the waypoint proxy deployed in the ns-1 namespace:
Because ns-2 shares the same waypoint proxy as ns-1, any policy deployed in ns-1 that is attached to the entire waypoint will impact both namespaces. For example, if you deploy an AuthorizationPolicy that is attached to the entire waypoint, the policy will be enforced by the waypoint for all of the services in both of the namespaces.
Why not simply deploy one per node for waypoint proxy?
While it is great to configure waypoint proxy as a gateway for different destination scope mentioned above, you may be wondering why not keep it simple and deploy the waypoint proxy per node as a Kubernetes DaemonSet to handle all the L7 processing for that node.
Waypoint proxies are required to be highly configurable where applications may have radically different performance requirements and runtime customizations, for example you can plugin in your external authz policy or your WASM extensions which are targeted only for specific services. A bad Envoy config from one tenant could bring down the node proxy affecting all workloads on that node.
Further, a noisy neighbor that requires heavy processing on a waypoint proxy could cause another application to perform poorly simply because the application happens to be also placed on the same node. By deploying Envoy per-node for waypoint proxy, you are essentially forcing all the apps on the node to share the same failure domain, regardless of their application latency or runtime requirements. Envoy has no tenancy controls and you can’t partition its CPU or memory to prevent noisy neighbors from starving other applications. A high traffic application with complex L7 policies would require more CPU and memory of the waypoint. The resource reservation for a Kubernetes DaemonSet is fixed and you can’t scale a few specific DeamondSet pods either horizontally or vertically on the node based on the traffic load without increasing CPU and memory of all of the waypoint pods. Cost-attribution is also hard in this model and you can’t properly attribute the waypoint incurred resource costs to each tenant.
Let us walk through this with a concrete example. In my environment, I have the Bookinfo application deployed in two namespaces and each namespace has its own namespace waypoint with the default configuration. I have configured the waypoint proxy for each namespace to be co-located with the productpage pods from each of the Bookinfo applications. I deployed a simple L7 AuthorizationPolicy in each of the namespace to only allow certain namespaces to perform the GET method:
Note: Ztunnel is omitted on the source and destination sides for simplicity
When I send 6500 RPS from Fortio to the productpage service in the first namespace and also same RPS from fortio to the productpage service in the 2nd namespace after 3 seconds, the average response time is 30.8ms for the first namespace and 30.4ms for the second namespace per the two diagrams below:
Namespace 1: Fortio to the productpage service in the first namespace through its waypoint: 6500 QPS 650 connections
Namespace 2: Fortio to the productpage service in the second namespace through its waypoint: 6500 QPS 650 connections
Because both of the Bookinfo apps have very heavy load, each Bookinfo app having its own waypoint proxy makes sense so they are not starving each other. In this case, what if the waypoint proxy is deployed per node instead?
I could deploy a waypoint proxy as a DaemonSet manually with very complex Envoy configurations. On the node where both of the productpage pods run, the waypoint proxy pod would be used by both of the productpage pods. However, this would require deep knowledge of Envoy configuration which I lack. A simple way to test is to configure both namespaces to use the same waypoint proxy which is recommended for apps that don’t require very heavy load or dedicated waypoint proxy (refer to the pattern C earlier for more info). Note: I would not recommend pattern C for this case as both of the Bookinfo apps have very heavy load.
After I made sure the waypoint proxy pod in the ns-1 namespace is also deployed on the same node as both of the productpage pods, I made a simple change to configure both namespaces to use the waypoint proxy in the ns-1 namespace. This enables me to quickly test a waypoint proxy pod used by both of the productpage pods that are also placed on the node as if I were to deploy the waypoint as a DaemonSet.
When I sent 6500 RPS from my Fortio load client to the productpage service in the first namespace, the average response time has grown from 30ms to 59ms! Same load to the prodctpage in the 2nd namespace started after 3 seconds (basically repeating the same test earlier except using a shared waypoint for this test), it reported 32% error, increased from 0% earlier! What a big impact when both of them are sharing the same waypoint proxy on the node! As both applications become very noisy with heavy load, the waypoint proxy hit its default CPU limit (2 core) and upstream connection limit (1024) thus many requests which were started slightly later from the 2nd namespace failed with a high percentage of errors. The Kubernetes worker node still had plenty of CPU, and it had about 50% CPU & 6% memory utilization at the peak time.
Namespace 1: Fortio to the productpage service in the first namespace through shared waypoint: 6500 QPS 650 connections
Namespace 2: Fortio to the productpage service in the second namespace through shared waypoint: 6500 QPS 650 connections
This exercise demonstrated using the same waypoint proxy for the node sharing the same failure domain could easily trigger failure scenarios with very noisy neighbors, which could simply be avoided by having dedicated waypoints for very busy tenants. While very noisy neighbors were used in the test, it could be bad Envoy configuration, or specific Envoy extension that could trigger similar issues when multiple tenants sharing the same failure domain.
Learn more about waypoints
If you understand gateways, you can make sense of waypoints. While I am excited about the default configuration for waypoint where a waypoint is for all services in a namespace, I am even more excited about the flexibility of waypoint deployment models based on your security, application, or resource requirements!
At Solo.io we are working with customers deploying mesh technologies to consistently manage application traffic in all directions. We’ve built Istio ambient in the community and provide Istio ambient GA support in our Gloo Mesh product to help you further simplify your mesh adoption. Explore Istio ambient today, and reach out to us for any questions.