Envoy at Scale with Gloo Edge

Denis Jannot | April 2, 2021

Gloo Edge is our Kubernetes native API gateway based on Envoy.

It provides Authentication (OAuth, JWT, API keys, JWT, …), Authorization (OPA, custom, …), Web Application Firewall (based on ModSecurity), function discovery (OpenAPI based, Lambda, …), advanced transformations and much more.

One of the first question our customers are generally asking us is how many instances of Envoy (the gateway-proxy Pod in Gloo Edge) are needed for their use case.

We generally answer that they need 2 instances to get High Availability, but that’s very rare that someone needs to deploy more instances for performance reasons.

In this Blog post series, I’m going to perform benchmarks to provide more accurate data.

I’ll start in this Blog post by benchmarking Gloo Edge without any filter (basic HTTP requests).

In the next post, I’ll show the impact of enabling different security features  (HTTPS, JWT, API keys, WAF, …).

Finally, I’ll write a Blog post about benchmarking Web Assembly (WASM).

Test environment

To perform my tests, I decided to deploy KinD in a VirtualMachine on GCP.

I know that CPU should be bottleneck, so I decided to use a `n1-highcpu-96` instance type.

You can find more information about this instance type in the table below:

I’ve deployed a Kubernetes cluster on this VM and Gloo Edge on this cluster.

I decided to use only one Envoy instance and to scale it only if I wasn’t able to use all the CPUs.

I did my first tests using the httpbin Docker image for the backend (Upstreams) and was getting some 5xx errors (between 0.1 and 0.2%) and after some investigations I found that the errors were caused by the httpbin Pods. Scaling the number of Pods didn’t have any effect.

So, I’ve tried with the echoenv Docker image that I used in my Istio at Scale talk because I knew this image was well optimized. After that, I never had a single error during all my tests.

I deployed echoenv as follow:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: echoenv-deployment-1
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echoenv-1
  template:
    metadata:
      labels:
        app: echoenv-1
    spec:
      containers:
        - name: echoenv-1
          image: quay.io/simonkrenger/echoenv
          imagePullPolicy: IfNotPresent
          ports:
            - name: http-port
              containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: echoenv-service-1
  namespace: default
  labels:
    app: echoenv-1
spec:
  ports:
    - name: http-port
      port: 8080
      targetPort: http-port
      protocol: TCP
  selector:
    app: echoenv-1

And I scaled it to 20 replicas. Even if I could probably have used less replicas without impacting the results.

To expose it using Gloo, I created the following VirtualService:

apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: default
  namespace: gloo-system
spec:
  virtualHost:
    domains:
      - "*"
    routes:
    - matchers:
      - prefix: /
      routeAction:
        single:
          upstream:
            name: default-echoenv-service-1-8080
            namespace: gloo-system

Finally, I needed a benchmarking tool and I opted for Apache JMeter. One could argue that there more Kubernetes native and sexy tools, but I knew this one would scale for sure.

I deployed a JMeter server and 8 JMeter clients (remote nodes) on my Kubernetes cluster.

Here is the test plan I used:

<?xml version="1.0" encoding="UTF-8"?>
<jmeterTestPlan version="1.2" properties="5.0" jmeter="5.3">
  <hashTree>
    <TestPlan guiclass="TestPlanGui" testclass="TestPlan" testname="Test Plan" enabled="true">
      <stringProp name="TestPlan.comments"></stringProp>
      <boolProp name="TestPlan.functional_mode">false</boolProp>
      <boolProp name="TestPlan.tearDown_on_shutdown">true</boolProp>
      <boolProp name="TestPlan.serialize_threadgroups">false</boolProp>
      <elementProp name="TestPlan.user_defined_variables" elementType="Arguments" guiclass="ArgumentsPanel" testclass="Arguments" testname="User Defined Variables" enabled="true">
        <collectionProp name="Arguments.arguments"/>
      </elementProp>
      <stringProp name="TestPlan.user_define_classpath"></stringProp>
    </TestPlan>
    <hashTree>
      <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="Thread Group" enabled="true">
        <stringProp name="ThreadGroup.on_sample_error">continue</stringProp>
        <elementProp name="ThreadGroup.main_controller" elementType="LoopController" guiclass="LoopControlPanel" testclass="LoopController" testname="Loop Controller" enabled="true">
          <boolProp name="LoopController.continue_forever">false</boolProp>
          <intProp name="LoopController.loops">-1</intProp>
        </elementProp>
        <stringProp name="ThreadGroup.num_threads">128</stringProp>
        <stringProp name="ThreadGroup.ramp_time">60</stringProp>
        <boolProp name="ThreadGroup.scheduler">false</boolProp>
        <stringProp name="ThreadGroup.duration"></stringProp>
        <stringProp name="ThreadGroup.delay"></stringProp>
        <boolProp name="ThreadGroup.same_user_on_next_iteration">true</boolProp>
      </ThreadGroup>
      <hashTree>
        <HTTPSamplerProxy guiclass="HttpTestSampleGui" testclass="HTTPSamplerProxy" testname="HTTP Request" enabled="true">
          <elementProp name="HTTPsampler.Arguments" elementType="Arguments" guiclass="HTTPArgumentsPanel" testclass="Arguments" testname="User Defined Variables" enabled="true">
            <collectionProp name="Arguments.arguments"/>
          </elementProp>
          <stringProp name="HTTPSampler.domain">gateway-proxy.gloo-system.svc.cluster.local</stringProp>
          <stringProp name="HTTPSampler.port"></stringProp>
          <stringProp name="HTTPSampler.protocol"></stringProp>
          <stringProp name="HTTPSampler.contentEncoding"></stringProp>
          <stringProp name="HTTPSampler.path">/</stringProp>
          <stringProp name="HTTPSampler.method">GET</stringProp>
          <boolProp name="HTTPSampler.follow_redirects">true</boolProp>
          <boolProp name="HTTPSampler.auto_redirects">false</boolProp>
          <boolProp name="HTTPSampler.use_keepalive">true</boolProp>
          <boolProp name="HTTPSampler.DO_MULTIPART_POST">false</boolProp>
          <stringProp name="HTTPSampler.embedded_url_re"></stringProp>
          <stringProp name="HTTPSampler.connect_timeout"></stringProp>
          <stringProp name="HTTPSampler.response_timeout"></stringProp>
        </HTTPSamplerProxy>
        <hashTree/>
      </hashTree>
    </hashTree>
  </hashTree>
</jmeterTestPlan>

As you can see, I’ve configured the test plan to send `GET` requests to `http://gateway-proxy.gloo-system.svc.cluster.local`.

1 billion requests in a little bit more than 3h !

The first test I performed was without setting any CPU and Memory limits.

As you can see, the average throughput was higher than 85 000 RPS (Requests Per Second).

Note that I was originally using the -l option of the jmeter.sh script to store on disk information about the test but it was slowing down the test after a few minutes. So, I removed this option and used the stats generated by Envoy and stored in Prometheus.

First of all, this number is amazing. It means that Gloo Edge can handle more than 7 billion requests per day. And with a single instance !

The other useful information is its ability to use most of the CPU available in the machine.

You can see that the average utilization of the 96 CPUs is higher than 85%.

And if I look at the Kubernetes metrics, I can see that Envoy is using half of it, while the remaining is used by the JMeter components and the Upstreams.

I’ve also tried running 2 instances of the gateway-proxy (Envoy) Pod (which is the common deployment pattern) and I was getting the same results, with half of the requests processed by each instance.

What about standard environments ?

Most of our customers are deploying Gloo Edge in Kubernetes clusters that are composed of smaller machines.

So, I decided to run the benchmark several times with different CPU limits set on the gateway-proxy Pod.

As you can see, it scales linearly, so it’s very easy to predict what performance you’ll get based on the number of CPUs, you’ll allocate to it.

I was able to get a little bit more than 2000 RPS per CPU, which means around 175 millions requests a day !

What about the memory usage ?

I didn’t speak much about the quantity of memory used during my tests because it was really low. The gateway-proxy Pod never used more than half a GB of RAM.

Conclusion

Gloo Edge takes full advantage of Envoy to deliver incredible performance.

I didn’t have to tweak a single parameter of Gloo Edge to get these numbers.

And I didn’t get a single error during all these tests !

It’s really easy to size Gloo Edge according to your needs.

In my next Blog post, I’ll measure the impact of implementing different security features of Gloo Edge.

Back to Blog