Load Testing Istio Ingress Guided by Grafana

One of the most common questions we get from Solo.io customers is about the number of replicas and architecture needed to use the Istio Ingress Gateway at scale.

We’ve seen many use cases where the Istio Ingress Gateway is one of the most important components in Istio. This is because it’s what handles the ingestion of traffic coming from outside. Due to the importance of its mission, the Istio Ingress Gateway designed to be solid and able to sustain high amounts of traffic out of the box.

In this post, we’ll work to find the limits of this component, stressing a single instance, so you can use the information to properly size your system and avoid unnecessary overallocation of costly resources.

The relevant parameters we are going to consider are CPU and memory consumption, response times, and response codes. We will also observe the receive/transmit bandwidth, and deduce the key resources when we put Envoy to the test.

ℹ️ NOTE: The complete set of scripts and code used to write this blogpost can be found in this repo.

Setting Up the Istio Ingress Gateway and Grafana
Load Testing the Istio Ingress Gateway With K6
Visualizing the Istio Ingress Gateway Metrics in Grafana dashboards
Analyzing the Istio Ingress Gateway Performance Under Load
Conclusion

Setting Up the Istio Ingress Gateway and Grafana

To properly test the Istio Ingress Gateway under load and gain observability into its performance, Grafana tools can be used. Specifically, the open-source K6 load testing tool from Grafana Labs and pre-built Istio dashboards in Grafana.

We are going to configure a non-trivial setup of 1,000 unique DNS domains, each of them exposed with TLS certificate, and serving 100 different paths. If you are familiar with Envoy, this is about 150MB of envoy configdump.

Setting Up the Environment

We will use a single EKS cluster, composed of different nodes:

6x t3.2xlarge (8cpu/32Gi) for general purpose (echoenv app, prometheus, grafana, gloo, K6 operator)
1x c5.4xlarge (16cpu/32Gi) for the Istio Ingress Gateway (only deployment in that node)
6x t3.medium (2cpu/4Gi) to execute the tests (K6 runners)

First, deploy an Istio Ingress Gateway in the Kubernetes cluster. To do this, we are going to use the Istio Lifecycle Manager, but any other supported method is also fine.

Then, install the K6 operator (version 0.0.9 was used to write this blogpost) and Grafana with the Istio dashboards.

Configuring the Ingress Gateway

The Istio Ingress Gateway acts as a reverse proxy to route external traffic to services in the cluster. It can actually route traffic to other external services, but let’s keep it simple.

For testing, configure the gateway to route traffic to a sample app, in this case the echoenv image as we did in other blogposts. This can be done by creating an Istio VirtualService (VS) that routes traffic from the ingress gateway to the app.

We will be using a Routetable to generate the necessary Istio resources, but you can create the VS directly as well:

apiVersion: networking.istio.io/v1beta1
kind: Gateway
name: virtualgateway-wrk-1-istio-gate-cf1809a3fba74fedbc2582a5c8f27ba
namespace: istio-gateways
spec:
selector:
app: istio-ingressgateway
istio: ingressgateway
revision: 1-18
servers:
- hosts:
- workspace-1-domain-1.com
port:
name: https-8443-workspace-1-domain-1-com
number: 8443
protocol: HTTPS
tls:
credentialName: wrk-1
mode: SIMPLE
[...]
- hosts:
- workspace-1-domain-100.com
port:
name: https-8443-workspace-1-domain-100-com
number: 8443
protocol: HTTPS
tls:
credentialName: wrk-1
mode: SIMPLE
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: routetable-wrk-1-dom-1-istio-gateways-cluster1-gateways
namespace: istio-gateways
spec:
exportTo:
- .
gateways:
- virtualgateway-wrk-1-istio-gate-cf1809a3fba74fedbc2582a5c8f27ba
hosts:
- workspace-1-domain-1.com
http:
- match:
- sourceLabels:
app: istio-ingressgateway
istio: ingressgateway
revision: 1-18
uri:
exact: /get/1
[...]
- sourceLabels:
app: istio-ingressgateway
istio: ingressgateway
revision: 1-18
uri:
exact: /get/100
name: echoenv-dom-1-echoenv.workspace-1.cluster1--wrk-1-dom-1.istio-gateways.cluster1
rewrite:
uri: /
route:
- destination:
host: echoenv.workspace-1.svc.cluster.local
port:
number: 8000

The Istio Ingress Gateway is now ready to serve 1,000 different domains, each one of them with 100 unique paths configured, like https://workspace-10-domain-100.com/get/100.

Running the K6 Load Test

To be able to send a high amount of traffic to the Istio Ingress Gateway, we can’t rely on a single machine. Our test executor must be distributed.

Grafana K6 operator was designed to address this issue, and it is really simple to use. After the initial installation, we just create a K6 resource:

apiVersion: k6.io/v1alpha1
kind: K6
metadata:
name: k6-runner
namespace: k6
spec:
parallelism: 18 # num machines * 3
script:
configMap:
name: k6-test
file: k6-test-single.js
separate: false
arguments: -o experimental-prometheus-rw
runner:
image: loadimpact/k6
env:
- name: K6_PROMETHEUS_RW_SERVER_URL
value: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/api/v1/write
- name : K6_PROMETHEUS_RW_TREND_STATS
value: p(90),p(95),max
- name: K6_PROMETHEUS_RW_STALE_MARKERS
value: "true"
securityContext:
runAsUser: 1000
runAsGroup: 1000
runAsNonRoot: true
sysctls:
- name: net.ipv4.ip_local_port_range
value: "1024 65535"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: eks.amazonaws.com/nodegroup
operator: In
values:
- n2-4-t3-medium
tolerations:
- key: k6
operator: Exists
effect: NoSchedule
resources:
limits:
cpu: 1100m
memory: 8000Mi
requests:
cpu: 500m
memory: 350Mi

Let’s also have a look at the actual K6 test, located in a configmap:

kubectl create configmap k6-test --from-file k6-test-single.js

You can see that the test is using one of the well-known K6 executors, so it is really just a few lines of code. In this case, we want to send all the traffic we can from 18 K6 distributed runners to a single pod of Istio Ingress Gateway:

import http from 'k6/http';
import { sleep, check } from 'k6';

export let options = {
insecureSkipTLSVerify: true,
discardResponseBodies: true,

scenarios: {
one: {
executor: 'constant-vus',
vus: 2000,
duration: '15m',
exec: 'nofilters'
},
},
};

export function nofilters() {
const params = {
tags: { name: 'singleMetricDynamicURL' },
};
const rnd2 = 1 + Math.floor(Math.random() * 50);
const res = http.get(`https://workspace-1-domain-1.com/get/${rnd2}`, params);//no filters
check(res, {
'is status 2xx': (r) => parseInt(r.status / 100) === 2 ,
'is status 4xx': (r) => parseInt(r.status / 100) === 4 ,
'is status 5xx': (r) => parseInt(r.status / 100) === 5 ,
'is status else': (r) => parseInt(r.status / 100) !== 2 && parseInt(r.status / 100) !== 4 && parseInt(r.status / 100) !== 5,
});
//sleep(1);
}

Viewing Metrics in Grafana

While the load test is running, Grafana dashboards provide visibility into how the ingress gateway is performing under load. Dashboards show metrics like request rates, response times, error rates, and resource saturation. This data can be used to determine maximum load the gateway can handle as well as potential bottlenecks.

In the custom dashboard, we can see metrics coming from different sources:

Istio Ingress Gateway
- Total of requests made during the test
- RPS from different perspectives (K6, Istio IG, Destination)
- CPU/memory usage
- Request duration and response codes
Istio Control Plane dashboard: Not relevant for this blog post

Load Testing the Istio Ingress Gateway With K6

To properly test the performance and reliability of Istio’s Ingress Gateway, it’s important to simulate traffic at high volumes. We will run a script to limit the CPU that the Ingress Gateway can use, first using 1 CPU, then 2, 4, and finally 8. The memory will be a fixed value of 4Gi.

This simple scenario will send GET requests to the /get/n endpoint of the Ingress Gateway, that will rewrite the request to be sent to / in the upstream server. Once K6 operator is in action, you will see that it creates 18 pods, each one of them executing a portion of the total requests to be sent.

Using this approach, you can use many cheap spot machines for your tests, for a virtually unlimited throughput:

containers:
- name: k6
image: loadimpact/k6
command:
- k6
- run
- '--quiet'
- '--execution-segment=10/18:11/18'
- >-
--execution-segment-sequence=0,1/18,2/18,3/18,4/18,5/18,6/18,7/18,8/18,9/18,10/18,11/18,12/18,13/18,14/18,15/18,16/18,17/18,1
- '-o'
- experimental-prometheus-rw
- /test/k6-test-single.js
- '--address=0.0.0.0:6565'
- '--paused'
- '--tag'
- instance_id=11
- '--tag'
- job_name=k6-runner-11
ports:
- containerPort: 6565
protocol: TCP
env:
- name: K6_PROMETHEUS_RW_SERVER_URL
value: >-
http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/api/v1/write
- name: K6_PROMETHEUS_RW_TREND_STATS
value: p(90),p(95),max
- name: K6_PROMETHEUS_RW_STALE_MARKERS
value: 'true'

Another cool thing about this many tester pods is that they produce metrics, so you can ship them to an external Prometheus, like we are doing.

It is also possible to ship them to an OpenTelemetry collector, where we can manipulate them before sending them to the final destination, which is usually a prometheus instance.

Visualizing the Istio Ingress Gateway Metrics in Grafana dashboards

After some time, you can have a very nice overview of what is happening in the system:

As we know from Envoy at scale data, Envoy can well sustain a little bit more than 2,000 RPS per CPU, so that was our expectation.

The version used for this test is 1.18.0, curated by Solo.io. It is essentially a hardened version of upstream Istio, with some enterprise filters that we won’t use for this test.

You can find the image in the workshops repository:

us-docker.pkg.dev/gloo-mesh/istio-workshops/proxyv2:1.18.0-solo

Analyzing the Istio Ingress Gateway Performance Under Load

To analyze the performance of the Istio Ingress Gateway under load, let’s observe the final snapshot of our 4-phase test:

We were able to send a few million requests to the gateway in 4 intervals of 15 minutes. The number we see in the picture belongs to the last 15-minute phase.

Scenario	CPU Usage	Memory usage	Receive bandwidth (MB/s)	Transmit bandwidth (MB/s)	Request per second (avg)
scn-1	1	1.33	13.3	16.4	4088
scn-2	2	1.35	25.6	31.8	7986
scn-3	3.99	1.43	46.4	57.2	14483
scn-4	7.98	1.47	83.1	101	25630

Request Duration and Bandwidth

In the two left panels, we observe that the request duration at percentile 50, 90 and 99. In this context, the client is the Istio Ingress Gateway, and the server is the application.

In the two right panels, we can see the network traffic ingested/sent by the gateway, which is 83.1 MB/s receive and 101 MB/s in the yellow block.

This is really useful information, as you can imagine using the gateway at scale is only possible if we can provide the network capacity that is needed for all services exposed in the gateway, and also the traffic coming from the downstream.

ℹ️ NOTE: Remember that in production, you will want to use many replicas of the Istio Ingress Gateway, so the input/output network restrictions will be divided among them. The goal of this blog post is to see how well a single replica can perform.

Resource Usage

One of the well-known limiting factors of Envoy is CPU; the more CPU you give it, the more RPS you can get. This is why we can see that in each of the 15-minute phases, Envoy is using all CPU that it is available, so we can confirm that it is not wasting it.

We can also observe that the CPU utilization is very high, meaning that Istio Ingress Gateway will not waste the allocated CPU and will run close to 100% in all scenarios.

From a memory perspective, we can observe that it is fairly stable under 1.5Gi. The memory is used to store Envoy configuration (in our case 1,000 different domains), so once it is started you can expect it to be stable even under a heavy load.

During our load testing, not a single error was observed. This metric is coming from K6 executors, so the producers of the initial requests.

Requests Per Second and Response Codes

This information is the most interesting, as we mentioned we expected something around ~2k RPS per CPU, however the actual numbers were a bit better than that:

~4k RPS using 1 cpu = 353 million requests per day
~8k RPS using 2 cpu = 690 million requests per day
~14k RPS using 4 cpu = 1.25 billion requests per day
~26k RPS using 8 cpu = 2.21 billion requests per day

Conclusion

In summary, by combining Grafana K6 load testing tool with Grafana dashboards, developers can easily analyze Istio Ingress Gateway performance under high loads. The K6 tool allows for the generation of large volumes of traffic to stress-test the gateway, while the Grafana dashboards provide detailed insights into metrics like request rates, response times, and error rates.

Using this approach, Istio operators can identify bottlenecks, ensure high availability, and optimize configurations in order to handle increased traffic demands as adoption of the service mesh grows within an organization. With the ability to simulate real user scenarios at scale, developers gain valuable insights to build and operate a robust Istio infrastructure.

A single replica of Istio Ingress Gateway is good enough for many production-grade applications, and combining it with an automated horizontal scaling of the gateway will make it real a solid solution for most applications.

However, to have high availability in addition to a good performance, a multicluster approach is recommended, as a cluster-wide failure can make a lot of harm to your business if you rely on a single-cluster approach.

To learn how you can do this, check out this step by step video:

Load Testing Istio Ingress Guided by Grafana

Table of Contents

Setting Up the Istio Ingress Gateway and Grafana

Setting Up the Environment

Configuring the Ingress Gateway

Running the K6 Load Test

Viewing Metrics in Grafana

Load Testing the Istio Ingress Gateway With K6

Visualizing the Istio Ingress Gateway Metrics in Grafana dashboards

Analyzing the Istio Ingress Gateway Performance Under Load

Request Duration and Bandwidth

Resource Usage

Requests Per Second and Response Codes

Conclusion

Featured content

Part Two: MCP Authorization The Hard Way

Part One: MCP Authorization The Hard Way

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Solo Academy Course: Service Mesh Basics

Solo Academy Course: Istio Basics

Solo Academy Course: Envoy Basics

Solo Academy Course: API Gateway Basics

Solo Academy Course: Get Started with Istio Service Mesh

Solo Academy Course: Introduction to Envoy Proxy

Solo Academy Course: Deploying Istio for Production

Kgateway Lab: Integrating kgateway with Istio at Ingress

Kgateway Lab: Kgateway as a Waypoint

Kgateway AI Lab: Consumption Reporting

Kgateway AI Lab: Deploying kgateway as an AI Gateway

Kagent Lab: How to build an AI agent