Performance Tuning for ExtAuth using OPA
Gloo Edge Enterprise provides the ability to create authorization policies for your workloads using powerful scripting tools like Open Policy Agent (OPA) as an important piece of your Zero-Trust framework. Aligned with the external authorization (extauth) features of Gloo Edge, this gives us the ability to scale OPA execution independently of the Gloo Edge gateway. In this blog, we will take a look at how to fine-tune that execution using a benchmarking and load test tool such as k6.
Getting started with ExtAuth using OPA and Gloo Edge
To begin our journey, we should take a benchmark of extauth against the Petclinic application to get a sense of the out-of-the-box performance characteristics. This blog assumes that you have already installed Gloo Edge Enterprise, but if you have not then just follow the steps at https://docs.solo.io/gloo-edge/latest/installation/enterprise/ using the Helm installation method. This is important since we will later use Helm templates to supply values to make reconfiguring our cluster easier.
We also need a standard size for our cluster. For this benchmark we used the following characteristics.
- Four worker nodes x 8 cores each (8 GB mem per vCPU)
- Two replicas of gateway-proxy
- Four replicas of extauth
To make sure we have the correct amount of replicas for gateway-proxy and extauth we will simply patch each of the deployments for now.
$> kubectl scale --replicas=2 rs/gateway-proxy -n gloo-system $> kubectl scale --replicas=4 rs/extauth -n gloo-system
Next, follow the steps at https://docs.solo.io/gloo-edge/latest/guides/security/auth/extauth/opa/#validate-jwts-with-open-policy-agent to deploy the Petclinic application, Dex server, AuthConfig, and the policy to check the JSON web token (JWT.) Make sure to test this configuration as we will need to ensure a successful response from the service for our load test.
Using the k6 load test tool
Follow the instructions at https://k6.io/docs/getting-started/installation.
Next, we will grab the Cookie header from the browser of our Petclinic application and put this in our simple k6 script as shown below. If you are not familiar with how to grab the Cookie header, simply right-click in your browser window. There should be an option for “Inspect” or “Developer Tools” for debugging purposes. Open the Network tab in the inspector and refresh your browser at http://localhost:8080. Then copy the Cookie header value from the request to localhost.
We will create a simple script like the one below. Make sure that both the id_token and access_token that are captured look like JSON key/value pairs in this script.
1 import http from 'k6/http'; 2 import { sleep } from 'k6'; 3 4 export default function () { 5 6 http.get('http://localhost:8080', { 7 cookies: { 8 id_token: 'eyJhbGciOiJSUzI1NiIsImtpZCI6IjMwYWFiNTY5MmFlNGNiYzEyODA3NjNhOWIzYWRjODdhMmE2YmNlZjcifQ.eyJpc3MiOiJodHRwOi8vZGV4Lmdsb28tc3lzdGVtLnN2Yy5jbHVzdGVyLmxvY2FsOjMyMDAwIiwic3ViIjoiQ2lVeE1qTTBOVFkzT0RrdFpHSTRPQzAwWWpjekxUa3dZVEF0TTJOa01UWTJNV1kxTkRZMkVnVnNiMk5oYkEiLCJhdWQiOiJnbG9vIiwiZXhwIjoxNjI1MDY2MzI0LCJpYXQiOjE2MjQ5Nzk5MjQsImF0X2hhc2giOiI2LTVHVTJaSlJrUjRwSmxqUEc3OXh3IiwiZW1haWwiOiJ1c2VyQGV4YW1wbGUuY29tIiwiZW1haWxfdmVyaWZpZWQiOnRydWV9.B2jwsEhatjfI3B9FMhmtU5dL0S2A1huqNOgpupEbIIA9Hh8XOmHXcBpYa-9VswXLZknVnOvPw4bPVUwGe6g3tIXUHGypno6FWS72LQPs8hJrNzwKiXGRl0umGR7FPgtJ2sA9Y2b0d3cGWJV9tGsf51QfInWtIVyYa0nS7vcKrvJEBy2FG8S6cFhwWrRiO2Lo1aoD7_ubjnN68EtOA_6JC8J2igK9xLd_oqSZpZu07lE4vzxsTs7HTPCMSJ07TaiBftdOCpzpL1pQuQNusHlbpYC7WnVYe03a1TyNUROXM7sJFMZtnR2OMGtLMiAi pJhYDxdXqGLxaPuJNxjEd0ILqA', 9 access_token: 'eyJhbGciOiJSUzI1NiIsImtpZCI6IjMwYWFiNTY5MmFlNGNiYzEyODA3NjNhOWIzYWRjODdhMmE2YmNlZjcifQ.eyJpc3MiOiJodHRwOi8vZGV4Lmdsb28tc3lzdGVtLnN2Yy5jbHVzdGVyLmxvY2FsOjMyMDAwIiwic3ViIjoiQ2lVeE1qTTBOVFkzT0RrdFpHSTRPQzAwWWpjekxUa3dZVEF0TTJOa01UWTJNV1kxTkRZMkVnVnNiMk5oYkEiLCJhdWQiOiJnbG9vIiwiZXhwIjoxNjI1MDY2MzI0LCJpYXQiOjE2MjQ5Nzk5MjQsImF0X2hhc2giOiJJTGxOaU56Sy1zRG5JdFhtWkQwYWNBIiwiZW1haWwiOiJ1c2VyQGV4YW1wbGUuY29tIiwiZW1haWxfdmVyaWZpZWQiOnRydWV9.c1j3DA_GlyJM-e6xkNYVRsHaEps3oToi4eBYdIQLGrWl8TDrdHe0WdDQnA7xTzEWgAvJ6BYUgzxyC5bNy319um3Uaz8sjCio4dIq_p4Cj9JQd1VKslrY9PtaxBPvEOpkA3ScpVsNdyryzHOXjUwENAaLI_Ony41RAbf3X2hBuNnoZ0_C96XY-3x2XApBNqjT3Z3FB6O3B6hHQKMnbOvKb868JJM69Bvtguju9VJefYLS3aCzGARCsMrj7V-WsWZFyujEZc8MlYeLD5kg3oE59oQl-Cp-OKAXk9iwNTa0xBST6oEYy3AJngSDwqW5bCzAGivKYbG0xQDDQpFKrYABQQ' 10 }, 11 }); 12 13 sleep(1); 14 } |
Run it once just to make sure everything looks good.
The access token we got back from Dex should have a full day before it expires.
If all looks good, we are ready to run our tests. Before we take a baseline, let’s add some options to the test to put the system under load. We would like to see how the system behaves under fairly significant load and we are going to use short load tests of five minutes. So, let’s add options directly to the test script.
1 import http from 'k6/http'; 2 import { sleep } from 'k6'; 3 4 export let options = { 5 stages: [ 6 { duration: '1m', target: 1000 }, // ramp up for 1 minute to 1000 users 7 { duration: '5m', target: 1000 }, // stay at 1000 users for 5 minutes 8 { duration: '1m', target: 0 }, // ramp down to 0 users 9 ], 10 } 11 ... |
This tool uses simultaneous connections as the target number and calls these virtual users (VUs).
Monitoring Extauth with OPA Performance
New with Gloo Edge Enterprise 1.8 is an Extauth Dashboard that’s useful for measuring latency. We have added some measurements to that dashboard for this blog post. If you are looking to replicate this in your environment, take a look at our previous blog on customizing Grafana for Envoy Metrics.
Let’s run the k6 script to get a baseline.
The system can’t keep up currently so we will definitely need to think about some strategies to increase performance. Let’s first also look into how much extauth performance attributed to this result.
Most everything looks fine but latency numbers are very high. You can also see that load may not be distributed evenly as the CPU usage is not even across the extauth instances.
Strategy #1: Increase performance of extauth
One good approach to increasing performance of a system under load is to address the code. In this scenario our code is pretty simple rego so there’s not much optimization we can do there. However, we can do something about how Go runs and one easy tweak we can make is slowing down the garbage collector.
By default, the Go garbage collector runs at a rate of 100% freshly allocated data to live data in the heap. So, if we want to slow down garbage collection we just need to set it to a higher percentage.
But first, let’s try something radical. Let’s turn off garbage collection. To do this, we need to modify the extauth deployment with the environment variable GOGC=OFF.
This clearly shows very bad results and makes clear that turning off garbage collection altogether is not a winning strategy. Somewhere during the run, I noticed that pods were evicted, the node died and the new pods could not recover before the run was finished. Perhaps we should try something not quite so radical as turning off garbage collection.
GOGC=500
Let’s see what happens when GC runs five times slower than default.
The error rate has gone down dramatically so this is a good sign that delaying GC is helping. Let’s take a look at the Grafana dashboard.
Much improved results can be seen here. Delaying GC has brought P95 below 20ms. Let’s keep going.
GOGC=1000
What if we delay GC by 10 times?
Not only has the error rate improved, but throughput has gone from ~10k/sec to ~13k/sec. That’s quite an improvement!
These are really interesting results. While throughput increased, it also looks like latency has gone back up somewhat. This could be due to the increased number of req/sec. Perhaps one more GC change can give us more information.
GOGC=2000
This is even more interesting. Slowing down GC by 20 times, we can see that the error rate is negligible at 0.10%. In addition, throughput normalized back to 10k req/sec. Just to make sure these results were consistent, we ran this same test five times and kept getting the same results. Could it be that the amount of time it takes to GC at 20x is enough to reduce throughput? Let’s see what happened to latency.
These are the best P95 results yet! At around 10ms, it seems we have honed in on a GC value that is sustainable and provides much improved performance. You can also detect a sawtooth pattern that indicates GC cycles in the CPU graph that’s occurring roughly every two and a half minutes. Going back to the GOGC=1000 results, it seems that we hit an anomaly where GC occurred on multiple processes at once. So, we may have been unlucky. This is a great point about load/stress testing – it takes patience and multiple runs. Overall though, we can be very confident in a 6000% improvement in P95 (from 750ms to 10ms) latency.
Let’s now take a look at our next strategy for improvement.
Strategy #2: Reduce Node Contention for ExtAuth using OPA
We have been running with four instances of extauth and two gateway-proxies but we haven’t paid attention to how they are distributed across the nodes. In fact, for all of the above tests (at least, after pods were evicted and a node had to be resurrected) there were three extauth instances on a single node. Not good!
Let’s fix that by setting pod anti-affinity to ensure that the gateway pods will not be placed on the same node. As it turns out, extauth is already deployed with affinity to the gateway pod to reduce latency. You can see this in the following code snippet taken from the extauth deployment.
spec: affinity: podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: gloo: gateway-proxy topologyKey: kubernetes.io/hostname weight: 100 |
We will edit the gateway-proxy deployment now to set the following anti-affinity.
spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: gloo operator: In values: - gateway-proxy topologyKey: kubernetes.io/hostname |
After rolling out this new deployment, let’s take a look at what effect it has.
Latency looks better again and this was an easy fix to ensure that we have enough CPU headroom. There’s one final strategy that we can employ.
Strategy #3: Scale Up ExtAuth using OPA!
The third strategy we will employ is the most common one. If you are putting the system under load and feel that it could improve, then scale the system up. For this run, we increased to four 16-core worker nodes. With our anti-affinity rules in place, we can now see that each node gets a pair of gateway & extauth pods.
The above run is our standard run and this shows that even P99 comes in under 10ms while P95 is around 2.5-3ms. We can also see that the system is not stressed as each node is utilizing at most around six cores for extauth. Let’s do one last run, but cap req/sec at 5000 and only use 500 VUs.
P99 is now under 2ms. That’s fantastic performance and provides us a gauge of sizing for “normal” load conditions.
Utilizing these three key strategies of increasing performance by reducing GC cycles, properly distributing pods with anti-affinity and scaling up the system gave us great results and confidence in how to properly scale the system for normal and increased load scenarios. In addition, with Gloo Edge Enterprise observability features we could get detailed information that helped us understand the bottlenecks in the system and how to address them to increase performance.
Please reach out to us on Slack if you would like to know more about performance with Gloo Edge Enterprise.