Upgrading your API Gateway Under Heavy Load
Most of the time, API Gateways are considered critical components of companies’ infrastructure. In some areas, like finance or telcos, API Gateways just can not stop working. There are sitting at the edge of mission-critical platforms and are subject to millions of requests per minute.
Platform teams operating and maintaining such business-critical components need to know all the best practices in terms of high availability. In this article, we will cover some of them in the context of upgrading the version of the Gloo Edge API Gateway under heavy load.
Expectations for Upgrading Your API Gateway Under Heavy Load
With a regular flow of a few hundred thousand requests per second, operators want to upgrade Gloo Edge from version 1.11.28 to version 1.11.30 seamlessly. Their management was crystal-clear on the SLAs, and not a single request shall be mistreated.
Here are a few key figures:
- Expected traffic is around 1 billion requests per hour
- Requests are evenly distributed to Gloo Gateways, themselves running on a few dozen of nodes
- There is a globally available HTTP Load Balancer routing requests to a unique GKE cluster
- Clients making the HTTP requests identity themselves with API-keys
Management wants to eliminate all sorts of noise between the clients and the services they consume during this operation. What the SREs want to show to their management is something that green:
A simple platform is used for this demo and looks like the following:
On our documentation website, we have published general guidelines on how to perform rolling upgrades without downtime; the cloud provider was AWS at that time. In this article, we will use Google Cloud Platform and combine our guidelines for GCP with the guidelines stated just before.
Google Load Balancer
As per this doc, we will create a few GCP resources to configure a global HTTP LB:
- a Network Endpoint Group, which represents our Gloo Gateways (dynamic routes to the pods)
- a Backend Service and the associated Health Checks
- a URL map, a Target HTTP Proxy, and a Forwarding Rule, together forming the Google Load Balancer
Gloo Edge is deployed using the Helm values that you will find in the following GitHub repository: https://github.com/solo-io/solo-blog/tree/main/upgrade-gloo-edge-1-billion-req
Note the use of a custom readiness probe which relies on an extra Envoy/Gateway health check filter — you can find more info in this tutorial.
In addition, there is an extra annotation on the Service so that the Google Cloud controller for Kubernetes automatically creates a new Network Endpoint Group.
For this demo, we will use a bare Envoy instance returning a 200 HTTP code for all requests. The Envoy configuration is visible in the GitHub repository under
The Upgrade Demo
With the settings above and the latest improvements of Gloo Edge put together, we expect a clean upgrade (or rollback) with a simple Helm upgrade command:
helm upgrade gloo glooe/gloo-ee --namespace gloo-system --version 1.11.30 \
--create-namespace --set-string license_key="$LICENSE_KEY" -f values-prod-gke-apikey-scaled-up.yaml
The entire demo was recorded and posted on Youtube:
Takeaways for SREs
Upgrading with Helm
In the most recent versions of Gloo Edge (v1.11.27+ and v1.12+), our engineering has improved a lot the upgrade process. There are now Kubernetes Jobs that are responsible for properly upgrading your Gloo resources (the
resource-migration Job). The Gloo Custom Resources (Gateways, Upstreams specifically) are no longer rendered directly as part of the Helm manifest; they are applied with
kubectl, to ensure correct ordering. This migration Job is to make sure the CRs don’t get deleted during the upgrade.
As depicted in this article, the most important piece is to have health checks configured in your load balancer (Envoy downstream), and between Envoy and the upstream services. This combined with GCP NEGs, you will get a dynamic pool of endpoints in the load balancer and no requests should be sent to dying Envoy instances (thanks to the graceful shutdown).
If you see the number of 5xx rising, you need to understand the root cause, which can be tricky to spot. The first thing to do is to enable the
%RESPONSE_FLAG% command in the access logs so that Envoy can hint you on the actual issue.
For instance, it can be an issue on the downstream side or an issue on the upstream side. Also, the issue can be at the HTTP level or at the TCP level. Luckily, Envoy has a rich toolset of options ready to help you tackle these issues.
It’s pretty common to see things improve as soon as you enable retries on the upstream service. There are different conditions that can trigger an HTTP retry, like a connection failure, a 5xx returned by the upstream service, and many more.
Also, you may want to prevent your upstream services from being overwhelmed by retries triggered by your fleet of API Gateways. The
maxRetries: 30 option will help to deal with that.
If you see the number of 503 (gateway timeout) increase, you may want to scale up the number of backend services.
There are many places where you can tweak timeouts: upstream, downstream, at the connection level or at the HTTP request level, on the main requests or on the health checks, on the retries, etc.
We have put together a list of all the timeouts options available for Gloo Edge on this page: https://docs.solo.io/gloo-edge/master/reference/cheatsheets/timeouts/
Finally, another point you potentially would like to consider if you have thousands of pods running in your cluster is to adopt an eBPF-based CNI, like Cilium. Cilium performs extremely well at scale, outscoring the good old IPtables in terms of performance.
More generally, see these guidelines for production-grade deployments: https://docs.solo.io/gloo-edge/master/operations/production_deployment/
As you can see in this article, Gloo Edge configuration is fully driven by code. Your configuration repository is Git, and Gloo Edge and the GitOps ecosystem are definitely a good fit.
Upgrading your API Gateway under heavy load can help you to maintain business-critical components. Also, note that Gloo Edge leverages the Kubernetes API in many areas: service discovery, proxy readiness probe, configuration management, and other cloud provider-specific annotations, in order to help you better manage your configuration.BACK TO BLOG