Envoy at Scale with Gloo Edge

Denis Jannot | April 02, 2021

Gloo Edge is our Kubernetes-native API gateway based on the open source Envoy Proxy. It provides authentication (using OAuth, JWT, API keys, and JWT), authorization (with OPA or custom approaches), a web application firewall (WAF – based on ModSecurity), function discovery (OpenAPI-based and AWS Lambda), and advanced transformations.

One of the first question our customers are generally asking us is how many instances of Envoy (the gateway-proxy Pod in Gloo Edge) are needed for their use case. We generally answer that they need two instances to get high availability, but that it is very rare that someone needs to deploy more instances for performance reasons. In this blog post series, we’re going to perform benchmarks to provide more accurate data.

We’ll start in this blog post by benchmarking Gloo Edge without any filters, only basic HTTP requests. In the next post, we’ll show the impact of enabling different security features like HTTPS, JSON Web Tokens (JWT), API keys, and a WAF. Finally, we’ll write a blog post about benchmarking WebAssembly (Wasm.)

Test environment

To perform my tests, we will deploy KinD in a VirtualMachine on Google Cloud (GCP.) We know that CPU should be bottleneck, so we will use a `n1-highcpu-96` instance type. You can find more information about this instance type in the table below:

First let’s deploy a Kubernetes cluster on this virtual machine (VM) and Gloo Edge on this cluster. We will use only one Envoy instance and to scale it only if we aren’t able to use all the CPUs. We will do our first tests using the httpbin Docker image for the backend (Upstreams), but we are getting some 5xx errors (between 0.1 and 0.2%) and after some investigation we see that the errors were caused by the httpbin Pods. Scaling the number of Pods didn’t have any effect. So, we will try with the echoenv Docker image that we used in the Istio at Scale talk because we know this image is well optimized. After that, we won’t see another error during all our tests.

We can deploy echoenv as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: echoenv-deployment-1
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echoenv-1
  template:
    metadata:
      labels:
        app: echoenv-1
    spec:
      containers:
        - name: echoenv-1
          image: quay.io/simonkrenger/echoenv
          imagePullPolicy: IfNotPresent
          ports:
            - name: http-port
              containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: echoenv-service-1
  namespace: default
  labels:
    app: echoenv-1
spec:
  ports:
    - name: http-port
      port: 8080
      targetPort: http-port
      protocol: TCP
  selector:
    app: echoenv-1

And we can scale it to 20 replicas, even if we could probably use fewer replicas without impacting the results.

To expose it using Gloo Edge, we will create the following VirtualService:

apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: default
  namespace: gloo-system
spec:
  virtualHost:
    domains:
      - "*"
    routes:
    - matchers:
      - prefix: /
      routeAction:
        single:
          upstream:
            name: default-echoenv-service-1-8080
            namespace: gloo-system

Finally, we need a benchmarking tool, so let’s use Apache JMeter. One could argue that there more Kubernetes-native tools, but we know this one will scale to meet our needs. We will deploy a JMeter server and 8 JMeter clients (remote nodes) on our Kubernetes cluster. Here is the test plan:

<?xml version="1.0" encoding="UTF-8"?>
<jmeterTestPlan version="1.2" properties="5.0" jmeter="5.3">
  <hashTree>
    <TestPlan guiclass="TestPlanGui" testclass="TestPlan" testname="Test Plan" enabled="true">
      <stringProp name="TestPlan.comments"></stringProp>
      <boolProp name="TestPlan.functional_mode">false</boolProp>
      <boolProp name="TestPlan.tearDown_on_shutdown">true</boolProp>
      <boolProp name="TestPlan.serialize_threadgroups">false</boolProp>
      <elementProp name="TestPlan.user_defined_variables" elementType="Arguments" guiclass="ArgumentsPanel" testclass="Arguments" testname="User Defined Variables" enabled="true">
        <collectionProp name="Arguments.arguments"/>
      </elementProp>
      <stringProp name="TestPlan.user_define_classpath"></stringProp>
    </TestPlan>
    <hashTree>
      <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="Thread Group" enabled="true">
        <stringProp name="ThreadGroup.on_sample_error">continue</stringProp>
        <elementProp name="ThreadGroup.main_controller" elementType="LoopController" guiclass="LoopControlPanel" testclass="LoopController" testname="Loop Controller" enabled="true">
          <boolProp name="LoopController.continue_forever">false</boolProp>
          <intProp name="LoopController.loops">-1</intProp>
        </elementProp>
        <stringProp name="ThreadGroup.num_threads">128</stringProp>
        <stringProp name="ThreadGroup.ramp_time">60</stringProp>
        <boolProp name="ThreadGroup.scheduler">false</boolProp>
        <stringProp name="ThreadGroup.duration"></stringProp>
        <stringProp name="ThreadGroup.delay"></stringProp>
        <boolProp name="ThreadGroup.same_user_on_next_iteration">true</boolProp>
      </ThreadGroup>
      <hashTree>
        <HTTPSamplerProxy guiclass="HttpTestSampleGui" testclass="HTTPSamplerProxy" testname="HTTP Request" enabled="true">
          <elementProp name="HTTPsampler.Arguments" elementType="Arguments" guiclass="HTTPArgumentsPanel" testclass="Arguments" testname="User Defined Variables" enabled="true">
            <collectionProp name="Arguments.arguments"/>
          </elementProp>
          <stringProp name="HTTPSampler.domain">gateway-proxy.gloo-system.svc.cluster.local</stringProp>
          <stringProp name="HTTPSampler.port"></stringProp>
          <stringProp name="HTTPSampler.protocol"></stringProp>
          <stringProp name="HTTPSampler.contentEncoding"></stringProp>
          <stringProp name="HTTPSampler.path">/</stringProp>
          <stringProp name="HTTPSampler.method">GET</stringProp>
          <boolProp name="HTTPSampler.follow_redirects">true</boolProp>
          <boolProp name="HTTPSampler.auto_redirects">false</boolProp>
          <boolProp name="HTTPSampler.use_keepalive">true</boolProp>
          <boolProp name="HTTPSampler.DO_MULTIPART_POST">false</boolProp>
          <stringProp name="HTTPSampler.embedded_url_re"></stringProp>
          <stringProp name="HTTPSampler.connect_timeout"></stringProp>
          <stringProp name="HTTPSampler.response_timeout"></stringProp>
        </HTTPSamplerProxy>
        <hashTree/>
      </hashTree>
    </hashTree>
  </hashTree>
</jmeterTestPlan>

As you can see, we’ve configured the test plan to send `GET` requests to `http://gateway-proxy.gloo-system.svc.cluster.local`.

1 billion requests in about three hours!

The first test we will perform is without setting any CPU and memory limits.

As you can see, the average throughput is more than 85,000 requests per second (RPS.) Note that we originally used the -l option of the jmeter.sh script to store information about the test but it was slowing down the test after a few minutes. So, we will remove this option and used the stats generated by Envoy and stored in Prometheus.

First of all, this number is amazing. It means that Gloo Edge can handle more than 7 billion requests per day. And with only a single instance! The other useful information is its ability to use most of the CPU available in the machine.

You can see that the average utilization of the 96 CPUs is more than 85%. And if we look at the Kubernetes metrics, we can see that Envoy is using half of the CPU, while the remaining is used by the JMeter components and the Upstreams. We will also try running two instances of the gateway-proxy (Envoy) Pod (which is the common deployment pattern) and we will see the same results, with half of the requests processed by each instance.

What about standard environments?

Most of our customers are deploying Gloo Edge in Kubernetes clusters that are composed of smaller machines. So, let’s run the benchmark several times with different CPU limits set on the gateway-proxy Pod.

As you can see, resource usage scales linearly, so it’s very easy to predict what performance you’ll get based on the number of CPUs you’ll allocate to it. We can get a little bit more than 2,000 RPS per CPU, which means around 175 millions requests a day!

What about the memory usage ? We aren’t concerned about the quantity of memory needed for our tests because isn’t significant. The gateway-proxy Pod won’t consume more than half of a GB of RAM.

Conclusion

Gloo Edge takes full advantage of Envoy to deliver incredible performance. We don’t have to tweak a single parameter of Gloo Edge to get these numbers. And we won’t get a single error during all these tests! It’s really easy to size Gloo Edge according to your needs.

In our next blog post of the series, we’ll measure the impact of implementing different security features of Gloo Edge.