Fewer Moving Parts: How Ambient Mesh Simplifies Istio Operations
I’ve learned a handful of truths that resonate across my professional career that spans almost 35 years now. One of them is summed up in Occam’s Razor, or the Law of Parsimony. It’s the problem-solving principle that “entities should not be multiplied beyond necessity”. In other words, all else being equal, simple is better than complex.
I’ve seen Occam’s Razor applied intelligently across a range of human pursuits:
- Science – Isaac Newton: “We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances.”
- Philosophy – Aristotle: “We may assume the superiority ceteris paribus [other things being equal] of the demonstration which derives from fewer postulates or hypotheses.”
- Law – Jeffery Atik, Loyola Law School Professor: “Law accretes novel tests and additional considerations, evolving rules that appear to be more precise, yet which burden the law with undesirable complexity.”
But in my career I’ve seen this principle confirmed over and over in a systems engineering context.
For example, I recently chatted with an engineer recently who had used open-source Istio in production at an earlier stop in his career. He appreciated the ease with which mTLS security policies could be applied throughout the mesh without making code changes to the underlying service. But operationally he found it to be quite challenging. He talked about problems like version upgrades and resultant race conditions leading to system outages. His Istio operational experience didn’t nullify his appreciation of the Connect, Secure, Observe pillars of Istio. But he was looking for a way to ease the ops burden without giving up the benefits he loved.
Does Istio Ambient Mesh resolve these problems?
And if so, how does it make your operational life simpler?
How did we get here?
First, I think some background is useful. Recall the pre-Istio days. Let’s assume that in that “prehistoric” era circa 2017, you had a security compliance requirement, a frequent driver of modern Istio adoption. You needed to assure that all the services in your application network had their inter-service communication secured by mTLS. How would you have achieved that? Well, you might have created a wiki page laying out the security mandate, and you might have provided pointers to a list of acceptable libraries that your applications could use to add mTLS to their services, and you would have specified a deadline by which all the secure system connections would be tested. And then you’d wait.
But then you hear about this new open-source project called Istio. Using a very different approach, you inject Envoy proxies as sidecars to all your services. No modification to underlying application code required. You specify a simple policy or two. Voila! Nearly instant mTLS security throughout your application network.
Istio 1.0 with sidecars was a tremendous innovation, and enabled many organizations to scale up their service networks by reducing the difficulty in connecting, securing, and observing them.
But user experience gained since Istio’s inception teaches us that those substantial gains do not come without tradeoffs.
One of those tradeoffs is the operational overhead required to maintain an Istio network of sidecars. It’s still an acceptable tradeoff for Istio adopters, but many over the years have speculated that there should be a better way.
Both Google and Solo.io are leaders on the Istio project, and so are always looking for ways to improve Istio user experience. They were coincidentally working on similar but separate efforts to overhaul the Istio data plane architecture. It would offer a sidecar-less alternative without compromising its fundamental security benefits. In other words, users could establish a service mesh that was ambient, offering the ability to declare policies for a network of services without requiring individual workloads to have Envoy sidecar proxies injected. Their collaboration culminated in the experimental release of Istio Ambient Mesh earlier this month.
Early feedback from the community has been overwhelmingly positive. Matt Klein from Lyft and the creator of Envoy proxy quickly tweeted his approval of the new, simpler dataplane architecture.
What challenges does Ambient address?
Istio Ambient Mesh substantially reduces the overhead of operating a service mesh. For example, with sidecars, let’s assume there is an Envoy CVE that requires an update to Istio’s sidecar proxies. What does that update process look like? While simple in concept, for a large enterprise with extensive service networks, it could be a heavy lift. Maybe you have thousands of microservice instances spread across hundreds of Kubernetes pods. That translates into thousands of injected sidecars that require upgrades. There’s no way to apply those upgrades without performing a rolling update of all the attached application workloads. While it’s easy to diagram that on a whiteboard, reality teaches us that it can be an error-prone process that risks service outages.
What if race conditions between sidecars and their application workloads come into play as they shut down and spin back up again? What if you encounter batches of failures in the sidecar upgrades? We can work to minimize these risks through solid DevOps practices and testing. But for a large-scale, pervasive change like a sidecar update throughout a large network, you cannot drive these risks to zero.
That kind of complexity keeps me up at night. We strive for simplicity.
How does Ambient simplify operations?
Ambient reduces Istio operational complexity in two ways:
- By reducing the number of moving parts in your enterprise-scale service mesh; and
- By changing the nature of the moving parts from injected sidecars to independent infrastructure.
Consider these before-and-after diagrams. We initially deploy a single sidecar per pod in the original Istio model. Because it is an injected sidecar, its lifecycle is inextricably tied to its companion workload. When we update and bounce the sidecar, we must bounce the application as well. In this depiction, there are nine separate applications utilizing twelve pods across three nodes. Each of those pods requires its own sidecar, for a total of twelve sidecars.
In the Ambient model, sidecars are no longer required. You can use them where you like, and they interoperate seamlessly with Ambient-enabled services. But sidecars are not mandatory.
Let’s assume you are like many Istio users and that your primary concern is securing the internals of your mesh. In other words, you want to start by using mTLS to secure the communication channels within your mesh. But other features like L7 security policies and observability can wait until later.
In that case, all you need are ztunnels (zero-trust tunnels), specially configured L4-only proxies that mediate cross-node traffic between services and ensure proper mTLS connectivity. Ztunnels are configured by the Istiod control plane
Note three significant operational benefits from this scenario. First, we have significantly reduced the number of moving parts. In the original Istio model, twelve separate L7-capable Envoy sidecars were present, one per Kubernetes pod. Now we have reduced that to three lighter-weight L4-only proxies, or one per Kubernetes node. Of course, your mileage may improve if you pack a greater number of pods into each node.
Second, the moving parts that remain are no longer injected as sidecars into application workloads. Updates to ztunnels are easier and bear less risk. We no longer worry about race conditions breaking our service network when the microservice instances and their sidecars spin up and down. Careful testing can mitigate these risks with sidecars as well, but the Ambient model is simpler and in many cases offers fewer opportunities for failure.
Even removing this functionality from your service network is simpler. With Ambient mode, Istio can disable mTLS and remove ztunnels altogether without disrupting mesh services at all. Try doing that with a sidecar! Check out this Istio Ambient demo to see that capability in action.
Third, the Ambient model enables organizations to incrementally adopt priority benefits of Istio without assuming the burden of the entire operational model. Here, we have enabled policy-driven mTLS security among services by deploying just the per-node L4 ztunnels without using L7 waypoint proxies or sidecars.
But the operational benefits don’t stop with enabling mTLS. Istio Ambient mode also introduces a slimmer deployment profile where L7 policies are required too. For example, let’s assume that we’d also like to inject higher-level L7 security policies (e.g., app 2 can only issue HTTP GET requests to app 9 but no other types like POSTs or PUTs) into our mesh. These can’t be handled by our L4 ztunnels alone.
To solve for supplying L7 policies, Istio Ambient introduces the notion of a Waypoint Proxy, an L7-capable Envoy proxy that is managed on a per-Service-Account basis. You control deployment of these waypoint proxies into your Kubernetes cluster and scale them as needed. Note that like ztunnels, waypoint proxies are not injected sidecars. They are separate infrastructure that are programmed by Istiod and independently scaled. And precisely as with ztunnels, upgrading waypoint proxies is a less risky operation without sidecars.
Work through a hands-on example of waypoint proxy deployment by enrolling in the free Solo Academy course on Istio Ambient Mesh.
Conclusions and Further Reading
In this article, we explored how Istio Ambient Mesh proves the age-old engineering Law of Parsimony, that simple is better than complex. By providing the option to remove application sidecars in favor of ztunnels and waypoint proxies deployed as infrastructure, we reduce the number of moving parts in our overall system. And in doing so, we enable incremental adoption and generally slash the operational burden of deploying a service mesh.
To learn more, we suggest the following resources:
- Read the Solo announcement blog post.
- Explore the free Solo Academy course on Istio Ambient Mesh.
- Listen to this Google Kubernetes Podcast episode with interviews of some of the creators of Ambient.
- Learn more about Gloo Mesh, the first commercial service mesh platform offering Ambient as a deployment option.
- Watch Solo Field CTO Christian Posta’s 10-minute Ambient Mesh demonstration.