Reduce Cloud Cost by 90% or More with Istio Ambient

Cut Service Mesh Overhead by 90% or More with Istio Ambient Mesh

Istio Ambient Mesh is the new Istio data plane introduced in the Istio community on September 7, 2022 with leading contributions from Solo.io and Google engineers. Read the full announcement on Istio’s website. Istio Ambient Mesh will be included in Istio 1.18 but pre-release builds are available now.

Sidecars have been a staple of Istio’s architecture since day one and are responsible for the majority of features available in Istio today. However, sidecars require the injection of an additional container to each Kubernetes pod resource, each of which needs allocated resources from the pod.

Many of you are familiar with the simplified operation brought by the Istio ambient architecture. Let’s explore how Istio Ambient Mesh cuts the service mesh costs typically associated with sidecars.

Check Out the Savings

Our test scenario deploys one Fortio client instance and three different versions of the httpbin service, each scaled to 10 replicas. The Fortio client will send requests to version 1 of httpbin for a few minutes, repeat the same for version 2, and finally for version 3. The tests deployed Istio in several scenarios and resource usage and allocation are compared across the runs. The scenarios are:

Istio with sidecar
Ambient with L4 ztunnel only
Ambient with L4 ztunnel and L7 waypoint proxies

Spoilers! There are significant savings across the board for all ambient scenarios when compared to sidecars – up to 99% savings in usage and 90% in allocation.

*Total Memory and CPU: Consumption and Allocation*

Looking at total CPU and memory consumption, we have to remember that in the ambient architecture, there are not sidecar containers for every application pod in the mesh. The result is that memory usage of Istio’s dataplane in the ztunnel-only ambient scenario uses 1% of what is used in sidecar scenarios, and still only 10% when waypoints are added. Looking at CPU, ztunnel once again uses 1% of what the sidecar scenario requires, and 15% when waypoints are deployed.

Moving on to allocation, every sidecar resource has a default request of 100 millicores vCPU and 128Mi memory. Assuming ztunnels and waypoint proxies have similar requests and limits as sidecars, that’s a 90% reduction in allocated resources between L4 ambient and sidecar and 80% with waypoints!

Where the Savings Are Coming From

Ambient mesh was designed to minimize resource requirements for users in their Kubernetes clusters. To explain how ambient does this, we must first clarify allocation versus utilization. When deploying a Kubernetes cluster on a hosted environment, nodes determine the overall capacity of the cluster and customer deployments, and pods are an allocation of that capacity. Utilization is a measure of how well this is done. As an architecture, sidecars interfere with effective utilization as they:

Define a high minimum for allocation at any scale
Can strand capacity by reserving more than is needed
Shift operational burden to adjust allocation if more than the default is needed

The sidecar architecture forces this mode of allocation on users. Ambient solves these issues by leveraging a new architecture that separates the responsibilities of zero-trust networking and L7 policy handling. This is done with two new components to Istio: ztunnels and waypoint proxies.

Ztunnels are a brand new Istio component written in Rust that are designed to be fast, secure, and lightweight. Ztunnels are deployed per node on a cluster and enable the most basic service mesh configurations for L4 features such as mTLS, telemetry, authentication, and L4 authorizations.
Waypoint proxies provide L7 mesh features such as VirtualService routing, L7 telemetry, and L7 authorizations policies. Waypoints are still based on Envoy and are deployed at the namespace level per ServiceAccount.

These ztunnels and waypoint proxies work in tandem to replace sidecars in the Istio service mesh. So let’s take a closer look at how the two architectures compare in the tests above.

A Closer Look

*Captures of CPU and memory usage by pod for sidecar and ambient pods for three scenarios*

‍

Let’s start by looking at CPU usage by pod. In the sidecar scenarios, the container utilizing the most CPU resources is the Fortio client sidecar istio-proxy, which is responsible for sending traffic to all pods during the test. The httpbin server istio-proxy containers see very few changes since the requests are load balanced across N httpbin replicas and their sidecars.

In the ambient scenarios, each ztunnel instance sees small spikes as they handle cross-node traffic for the different requests. These ztunnel spikes depend on which nodes the Fortio client and httpbin server reside on as the different versions are hit. In the ambient with both ztunnel and waypoint proxy scenario, there are clear spikes in the waypoint proxies as a particular version of httpbin is called since one waypoint captures traffic for all instances of that version.

Though the waypoints consume similar resources as sidecars, the Rust-based ztunnel has a much smaller CPU utilization. An individual ztunnel use less than 20% in comparison to a sidecar and all three ztunnels for the entire cluster combined use less than a single sidecar!

Next is memory usage by pod. In all Istio scenarios, memory usage stays relatively constant for each pod during the test runs. In both ambient scenarios, the L4 ztunnel consumes so little memory it’s almost hard to display them on the same scale as sidecar usage. Waypoint proxies consume a similar amount of resources as sidecars do, with only minor improvements. So what do these per pod usage improvements mean for the totals across the cluster? Everything.

*Captures of total CPU and memory comparing total workload to Istio dataplane usage for three scenarios*

*Captures of CPU and memory by pod for sidecar and ambient pods in stacked view for three scenarios*

Looking at total CPU and memory utilization, we have to remember that in sidecar scenarios there are 31 sidecar containers required (one Fortio client and 30 httpbin servers), while in ambient, only three ztunnel containers and three waypoint proxies are required. The stacked CPU and memory by pod graphs are excellent at highlighting just how many additional containers are present between scenarios. Memory usage of the Istio dataplane in the ztunnel-only ambient scenario uses 1% of what is used in sidecar scenarios, and still only 10% when waypoints are added. Looking at CPU, ztunnel once again uses 1% of what sidecar scenarios require, and 15% when waypoints are deployed.

Finally, let’s consider allocation since the graphs above have only covered usage. Every sidecar resource has a default request of 100 millicores vCPU and 128Mi memory, as well as limits set for 2 vCPUs and 1Gi memory. For simplicity, we assume ztunnels and waypoint proxies have similar requests and limits as their sidecar counterparts – even though every measurement so far has suggested ztunnels will require significantly less. Breaking it down, that’s a 90% reduction in allocated resources with ztunnel, and 80% when waypoints are included.

Going further, we can calculate a dollar amount for these numbers by referring to GCP monthly pricing for a monthly cost per GB memory and per CPU. Consider these two different machines and their costs:

n2-standard-4 (4CPU, 16GB) at $141/month
n2-highmem-4 (4CPU, 32GB) at $191/month

Calculating the difference in memory results in 1GB costing $3.33/month. Similarly for CPU:

n2-standard-4 (4CPU, 16GB) at $141/month
n2-highmem-2 (2CPU, 16GB) at $95/month

Calculating the difference results in 1 CPU costing approximately $23/month. We can literally put a dollar amount on the savings ambient brings!

Test It Out!

In comparison to what most users run Istio with in production, this test cluster is tiny. However, we expect even more savings with larger clusters and when more services are deployed. We encourage everyone to see what the savings with Istio ambient mesh look like in their environments. Also note that these scripts have been pushed to GitHub so feel free to check them out here. For tracking CPU and memory usage throughout the test scenarios, versions of Prometheus, node-exporter, and Grafana are installed. A custom Grafana dashboard was created for observing relevant data, which can be found and imported from GitHub here.

Conclusion

Ka-ching. These results were collected with a pre-alpha version of ambient which is now merged into the main branch.

Ambient service mesh’s goal of reducing infrastructure costs is bearing fruit and setting a solid foot forward on its roadmap to production readiness. These early numbers suggest users could cut their cloud usage by 99% and resource requirements by 90% – especially if users only require an L4 mesh.

Learn More About Istio Ambient Mesh

Check out these resources to learn more:

Announcing Istio Ambient Mesh by Idit Levine – Solo.io
Introducing Ambient Mesh article from John Howard – Google, Ethan J. Jackson – Google, Yuval Kohavi – Solo.io, Idit Levine – Solo.io, Justin Pettit – Google, Lin Sun – Solo.io
Get Started with Ambient Mesh guide by Lin Sun – Solo.io, John Howard – Google
Ambient Mesh Security Deep Dive article by Ethan Jackson – Google, Yuval Kohavi – Solo.io, Justin Pettit – Google, Christian Posta – Solo.io
Introducing Rust-Based Ztunnel for Istio Ambient Mesh article by Lin Sun – Solo.io, John Howard – Google
Istio Ambient Waypoint Proxy Made Simple article by Lin Sun – Solo.io, John Howard – Google
Istio Ambient Service Mesh Merged to Istio’s Main Branch article by Lin Sun – Solo.io, John Howard – Google
On demand workshop: Get Started with Istio Ambient Mesh (with Ambient Mesh Foundation Certification)
Workshop: Ambient Mesh In-Depth Routing Analysis
The Cloudcast podcast with Louis Ryan – Solo.io, Christian Posta – Solo.io

Cut Service Mesh Overhead by 90% or More with Istio Ambient Mesh

Where the Savings Are Coming From

A Closer Look

Test It Out!

Conclusion

Learn More About Istio Ambient Mesh

Featured content

How Ambient Mesh Delivers Advanced Resource and Cost Savings

Getting Started with Ambient Mesh: From 0 to 100 mph

Agent Discovery, Naming, and Resolution - the Missing Pieces to A2A

Part Two: MCP Authorization The Hard Way

Part One: MCP Authorization The Hard Way

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

Ingenico

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Ambient Mesh Lab: Introduction to ztunnel in Ambient Mesh

Solo Academy Course: Service Mesh Basics

Solo Academy Course: Istio Basics

Solo Academy Course: Envoy Basics

Solo Academy Course: API Gateway Basics

Solo Academy Course: Get Started with Istio Service Mesh

Solo Academy Course: Introduction to Envoy Proxy

Solo Academy Course: Deploying Istio for Production

Kgateway Lab: Integrating kgateway with Istio at Ingress

Kgateway Lab: Kgateway as a Waypoint

Kgateway AI Lab: Consumption Reporting

Kgateway AI Lab: Deploying kgateway as an AI Gateway

Kagent Lab: How to build an AI agent

Kagent Lab: Integrate tools from MCP servers with kagent

Gloo AI Gateway Hands-On Lab: Semantic Caching

Kgateway AI Lab: Credentials Management