Understanding Istio Ambient Ztunnel and Secure Overlay

There are a variety of tunnel mechanisms that can be used today to create connectivity between disconnected/remote networks. These have been traditionally used to overcome limitations in the WAN, or even to provide a deep layer of encryption.

IPsec, a protocol used to provide encryption and authentication between two remote networks using the internet as a medium
VXLAN/GENEVE, an overlay protocol used to carry MAC-in-UDP, and allow for extending Layer 2 networks across Layer 3 boundaries
NV/GRE – An IP-in-IP encapsulation and tunneling protocol to extend IP networks over IP networks

This has paved the way for a service mesh future.

The way microservices have evolved have demonstrated a need for flexible connectivity options while maintaining zero-trust, and as such, Istio has evolved to provide a sidecar-less data plane called ambient mesh. You can read more about the announcement from the Istio community over here. Very special considerations went into place to ensure that the properties of zero trust, such as identity and encryption, were maintained in the development of ambient mesh and the sidecar-less approach.

Ambient mesh has taken the service-mesh functionality and broken it into two complementary layers: one that focuses on securing Layer 4 connectivity and one that implements Layer 7 policy and behaviors.

L4 → ztunnel

Security: Mutual TLS encrypted communication among your applications with cryptographic-based identity. Simple L4 authorization policy
Observability: TCP metrics and logs
Traffic Control: TCP routing

L7 → waypoint proxy

Security: Rich L7 authorization policy
Observability: HTTP metrics, access logs and tracing
Traffic Management: Dark launch, canary test, resiliency, chaos testing and control outbound traffic

With ambient mesh, the concept of ztunnel and waypoint proxy were introduced. The zero-trust tunnel, or ztunnel, is strictly responsible for L4 connectivity. Ztunnels will create overlays to each other as well as waypoint proxies, and the waypoint proxy is responsible for L7 Policy enforcement.

Welcome to ztunnel

Ztunnel or “zero-trust” tunnel is a secure overlay layer that provides capabilities to secure connections between services. The key features of ztunnel are:

Security: Mutual TLS encrypted communication among your applications with cryptographic-based identity and L4 authorization policy
Observability: TCP metrics and logs
Connection multiplexing and balancing

Ztunnel originated from the idea of having a zero-trust tunnel where identity can be maintained on both ends using SPIFFE ID, which originates from the workloads. Ztunnel is a resource deployed as a daemonset and facilitates the creation of a secure overlay using the HTTP Based Overlay Network Encapsulation protocol, or HBONE for short.

HBONE runs on a dedicated port (15008) between the proxies in the data plane and uses mTLS and strong identity similar to how Istio currently works. However, HBONE is hidden to workloads–it’s not used by applications. The reasons to support a better transport mechanism include:

Better support for protocols including server-send first protocols
Better support for incrementally adopting Istio, especially when apps use their own TLS certificates
Support for calling pod IPs directly and eliminate ways around Istio mTLS/encap that we see today

HBONE is made possible by the enhancements of Envoy’s HTTP Upgrade to support tunneling raw TCP over HTTP POST or HTTP CONNECT. The upgrade details of Envoy’s HTTP Connect can be reviewed here.

Figure 1 – The traffic stream using HBONE on TCP Port 15008 between ztunnels and HBONE packet composition.

ztunnel implementation

Today, a ztunnel is deployed as a pod for each node in the cluster, using a daemonset and happens to be based on Envoy Proxy.. Envoy was the most expedient option for the original release of ambient but, because of the need for a simpler and higher performance Layer 4 component, the underlying implementation will change soon; more on this later. This is important because as long as you have two endpoints that can be configured to initiate and terminate the overlay, mTLS can be established and requests can be forwarded on.

To better understand how this works in ambient mesh, let’s use the workshop right here and use the example from that workshop to look under the hood of ztunnel.

Because the output is too large to display here, you should run the following commands in that workshop suggested above:

kubectl get pods -n istio-system
kubectl -n istio-system describe pod ztunnel-wl482

The output shows all pods in the istio-system namespace. The output also shows the ztunnel pods, which if you pick any one and run the second command, it describes the container specifications with several key areas of interest:

The image being used is the proxyv2 image, tuned for Ambient mesh, and is present both on the init container and the main running container.
The security context which calls out the NET_ADMIN privileges which is used for the redirection of traffic.
The specific X.509 certificate information for identity and encryption as you would normally see in the Istio sidecar configuration.
The token used to be authorized to ask Kubernetes to run through the CSR process for a given workload.

If you run kubectl -n istio-system logs -l app=ztunnel | grep ‘^\[‘ while making requests from one pod to another pod in Ambient mesh, you will also see the output of the requests being made, flowing through the secure overlay. Below is an output of the request from a sleep pod over to the web-api pod. You can also see that the connection is successful through the 200 HTTP Success code. Also, the SPIFFE identity is present for both web-api and sleep pods at each end of the tunnel.

root@virtualmachine:~# kubectl -n istio-system logs -l app=ztunnel | grep '^\['

[2022-09-26T12:41:52.458Z] "CONNECT - HTTP/2" 200 - via_upstream - "-" 127 440 1 - "-" "-" "7271ea29-33df-46a0-b1a9-023a39efd3fb" "10.101.1.4:8080" "10.101.1.4:8080" virtual_inbound 10.101.2.3:42811 10.101.1.4:15008 10.101.2.3:41588 - - inbound hcm

[2022-09-26T12:41:52.456Z] "- - -" 0 - - - "-" 125 825 5 - "-" "-" "-" "-" "10.101.2.5:15008" outbound_tunnel_clus_spiffe://cluster.local/ns/default/sa/web-api 10.101.1.2:45910 10.101.2.5:8080 envoy://internal_client_address/ - - outbound tunnel

Visualizing ztunnel

Where does identity originate from? ztunnel can assume the identity of the pods currently running on the same node. Similar to sidecar, each service account will have its own identity, and key/certificate pairs are signed for each service account via Certificate Signing Requests (CSR) requests from ztunnel to the Istio control plane. By using its own service account, ztunnel can have istiod sign the CSR and return an X.509 certificate for the workload it impersonates. The other end of the secure overlay will replicate that process for ztunnel and its co-located workload and this ensures end-to-end encryption. You can view the X.509 certificates managed by your ztunnel, using the istioctl pc secret command, then base 64 decode each of the certificates. Similar to the sidecar architecture, these X.509 certificates will be automatically rotated well before the expiration (every 12 hours by default in Istio) without you needing to do anything.

Let’s review a few architectures to better understand how the secure transport is initiated with ztunnel.

Normally, a service such as C1 would have a sidecar to establish a connection with S1, however, in this case, traffic will originate from C1 and be tunneled through the ztunnel at the local node, over to the ztunnel in the remote node, which will proceed to unencapsulate the traffic and forward on to the destination, S1. We see this depicted below.

Figure 2. Ztunnel implementation as daemonset, receiving configuration from Istiod

More specifically, below, the HBONE Overlay is established between ztunnels on each node. mTLS is still maintained to ensure encryption, and mTLS exists because each ztunnel has impersonated its co-locate workload.

Figure 3. Ztunnel pods established two-way tunnel with mTLS enabled

For any workloads with a waypoint proxy deployed, a secure tunnel is established between ztunnel and the waypoint proxy. We can see further below in figure 4, that the ztunnel and waypoint proxy establish a tunnel to each other and mTLS with identity is maintained end-to-end.

Figure 4. Ztunnel pods established two-way tunnels to waypoint proxy with mTLS enabled, Istiod is still sending configuration data to ztunnel and the waypoint proxy

How are we enforcing L4 authorization policies?

We can enforce service-to-service network policy with the L4/secure overlay layer including things like “deny-all” or very fine-grained service-to-service connectivity such as “Service A can talk to Service B but not Service C, etc”. We cannot do anything that requires parsing the connection like inspect HTTP headers or JWT tokens in the secure overlay layer.

Understanding ztunnel resiliency

Ztunnel is deployed as a daemonset in the form of a pod per node in the cluster, including Kubernetes control-plane nodes. Because of the daemonset K8s resource, if a ztunnel goes down, this implies that the pod has failed however, there are a few reasons why this could occur:

Underlying node has failed in the cluster
Node has run out of CPU, memory, and disk resources
The node has lost physical network connectivity
The node is in another network and a firewall is block traffic between this node and other nodes in the cluster

In the case of a ztunnel failure, such as the pod fails, the impact is limited to the workloads on the node where the ztunnel failed.

In any of these conditions, Kubernetes will reconcile ztunnels to the best of its ability provided upstream physical issues have been resolved. This is no different than any other application or critical component going down, and being reconciled by Kubernetes.

Optimizing ztunnel with eBPF and other approaches

Today, the Istio-CNI uses IPtables Rules to direct traffic into a tunnel, The IPTables rules for traffic redirection for ztunnels has a similar effect that a sidecar does in a pod.

eBPF can optimize this by controlling the packet processing of getting traffic into a tunnel, and removing the need for IPtables rules. Solo.io is in the process of open-sourcing this functionality to upstream Istio.

Also, mentioned previously, the ztunnel leverages Envoy today to establish the secure overlay between itself and other ztunnels, as well as waypoint proxies. The xDS configuration of Envoy relies on copying the entire config pipeline per workload and this is expected to be unscalable for larger K8s clusters, making updates to xDS more costly. Open source Istio is also exploring ways to optimize the ztunnel architecture using a proxy that could be developed in Rust. See here for more details: https://github.com/istio/istio/issues/40956

Conclusion

Istio Ambient Mesh is a take on sidecar-less service mesh that balances various concerns to achieve better operations, cost and performance. As we’ve seen in this blog, Istio ambient relies on layering and “opting-in” to various layers as needed. The secure overlay layer implemented with the ztunnel component provides the backbone for mTLS, zero-trust properties, and overall secure communications. The ztunnel component will continue to evolve and we encourage anyone interested in service-mesh and application networking to get involved. If you’d like to be an early adopter user with enterprise support (from the people who wrote the code!) please reach out to Solo.io

To learn more about Ambient Mesh, take a look at this blog post, and take this self-paced workshop!

Understanding Istio Ambient Ztunnel and Secure Overlay

Welcome to ztunnel

ztunnel implementation

Visualizing ztunnel

How are we enforcing L4 authorization policies?

Understanding ztunnel resiliency

Optimizing ztunnel with eBPF and other approaches

Conclusion

Featured content

Agent Discovery, Naming, and Resolution - the Missing Pieces to A2A

Part Two: MCP Authorization The Hard Way

Part One: MCP Authorization The Hard Way

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

Ingenico

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Solo Academy Course: Service Mesh Basics

Solo Academy Course: Istio Basics

Solo Academy Course: Envoy Basics

Solo Academy Course: API Gateway Basics

Solo Academy Course: Get Started with Istio Service Mesh

Solo Academy Course: Introduction to Envoy Proxy

Solo Academy Course: Deploying Istio for Production

Kgateway Lab: Integrating kgateway with Istio at Ingress

Kgateway Lab: Kgateway as a Waypoint

Kgateway AI Lab: Consumption Reporting

Kgateway AI Lab: Deploying kgateway as an AI Gateway

Kagent Lab: How to build an AI agent

Kagent Lab: Integrate tools from MCP servers with kagent

Gloo AI Gateway Hands-On Lab: Semantic Caching

Kgateway AI Lab: Credentials Management

Kgateway AI Lab: Prompt Enrichment