Upgrading your API Gateway Under Heavy Load

Most of the time, API Gateways are considered critical components of companies’ infrastructure. In some areas, like finance or telcos, API Gateways just can not stop working. There are sitting at the edge of mission-critical platforms and are subject to millions of requests per minute.

Platform teams operating and maintaining such business-critical components need to know all the best practices in terms of high availability. In this article, we will cover some of them in the context of upgrading the version of the Gloo Edge API Gateway under heavy load.

Expectations for Upgrading Your API Gateway Under Heavy Load

With a regular flow of a few hundred thousand requests per second, operators want to upgrade Gloo Edge from version 1.11.28 to version 1.11.30 seamlessly. Their management was crystal-clear on the SLAs, and not a single request shall be mistreated.

Here are a few key figures:

  • Expected traffic is around 1 billion requests per hour
  • Requests are evenly distributed to Gloo Gateways, themselves running on a few dozen of nodes
  • There is a globally available HTTP Load Balancer routing requests to a unique GKE cluster
  • Clients making the HTTP requests identity themselves with API-keys

Management wants to eliminate all sorts of noise between the clients and the services they consume during this operation. What the SREs want to show to their management is something that green:

 

Platform Setup

A simple platform is used for this demo and looks like the following:

On our documentation website, we have published general guidelines on how to perform rolling upgrades without downtime; the cloud provider was AWS at that time. In this article, we will use Google Cloud Platform and combine our guidelines for GCP with the guidelines stated just before.

Google Load Balancer

As per this doc, we will create a few GCP resources to configure a global HTTP LB:

  • a Network Endpoint Group, which represents our Gloo Gateways (dynamic routes to the pods)
  • a Backend Service and the associated Health Checks
  • a URL map, a Target HTTP Proxy, and a Forwarding Rule, together forming the Google Load Balancer

Gloo Edge

Gloo Edge is deployed using the Helm values that you will find in the following GitHub repository: https://github.com/solo-io/solo-blog/tree/main/upgrade-gloo-edge-1-billion-req

Note the use of a custom readiness probe which relies on an extra Envoy/Gateway health check filter — you can find more info in this tutorial.

In addition, there is an extra annotation on the Service so that the Google Cloud controller for Kubernetes automatically creates a new Network Endpoint Group.

Backend Service

For this demo, we will use a bare Envoy instance returning a 200 HTTP code for all requests. The Envoy configuration is visible in the GitHub repository under upgrade-gloo-edge-1-billion-req/direct-response-backend.

The Upgrade Demo

With the settings above and the latest improvements of Gloo Edge put together, we expect a clean upgrade (or rollback) with a simple Helm upgrade command:

helm upgrade gloo glooe/gloo-ee --namespace gloo-system --version 1.11.30 \

--create-namespace --set-string license_key="$LICENSE_KEY" -f values-prod-gke-apikey-scaled-up.yaml

The entire demo was recorded and posted on Youtube:

Takeaways for SREs

Upgrading with Helm

In the most recent versions of Gloo Edge (v1.11.27+ and v1.12+), our engineering has improved a lot the upgrade process. There are now Kubernetes Jobs that are responsible for properly upgrading your Gloo resources (the resource-migration Job). The Gloo Custom Resources (Gateways, Upstreams specifically) are no longer rendered directly as part of the Helm manifest; they are applied with kubectl, to ensure correct ordering. This migration Job is to make sure the CRs don’t get deleted during the upgrade.

Health Checks

As depicted in this article, the most important piece is to have health checks configured in your load balancer (Envoy downstream), and between Envoy and the upstream services. This combined with GCP NEGs, you will get a dynamic pool of endpoints in the load balancer and no requests should be sent to dying Envoy instances (thanks to the graceful shutdown).

Troubleshooting

If you see the number of 5xx rising, you need to understand the root cause, which can be tricky to spot. The first thing to do is to enable the %RESPONSE_FLAG% command in the access logs so that Envoy can hint you on the actual issue.

For instance, it can be an issue on the downstream side or an issue on the upstream side. Also, the issue can be at the HTTP level or at the TCP level. Luckily, Envoy has a rich toolset of options ready to help you tackle these issues.

Retries

It’s pretty common to see things improve as soon as you enable retries on the upstream service. There are different conditions that can trigger an HTTP retry, like a connection failure, a 5xx returned by the upstream service, and many more.

Circuit Breaker

Also, you may want to prevent your upstream services from being overwhelmed by retries triggered by your fleet of API Gateways. The maxRetries: 30 option will help to deal with that.

Scale Up

If you see the number of 503 (gateway timeout) increase, you may want to scale up the number of backend services.

Timeout Settings

There are many places where you can tweak timeouts: upstream, downstream, at the connection level or at the HTTP request level, on the main requests or on the health checks, on the retries, etc.

We have put together a list of all the timeouts options available for Gloo Edge on this page: https://docs.solo.io/gloo-edge/master/reference/cheatsheets/timeouts/

CNI

Finally, another point you potentially would like to consider if you have thousands of pods running in your cluster is to adopt an eBPF-based CNI, like Cilium. Cilium performs extremely well at scale, outscoring the good old IPtables in terms of performance.

More generally, see these guidelines for production-grade deployments: https://docs.solo.io/gloo-edge/master/operations/production_deployment/

Final Words

As you can see in this article, Gloo Edge configuration is fully driven by code. Your configuration repository is Git, and Gloo Edge and the GitOps ecosystem are definitely a good fit.

Upgrading your API Gateway under heavy load can help you to maintain business-critical components. Also, note that Gloo Edge leverages the Kubernetes API in many areas: service discovery, proxy readiness probe, configuration management, and other cloud provider-specific annotations, in order to help you better manage your configuration.