Upgrading your API Gateway Under Heavy Load

Most of the time, API Gateways are considered critical components of companies’ infrastructure. In some areas, like finance or telcos, API Gateways just can not stop working. There are sitting at the edge of mission-critical platforms and are subject to millions of requests per minute.

Platform teams operating and maintaining such business-critical components need to know all the best practices in terms of high availability. In this article, we will cover some of them in the context of upgrading the version of the Gloo Edge API Gateway under heavy load.

Expectations for Upgrading Your API Gateway Under Heavy Load

With a regular flow of a few hundred thousand requests per second, operators want to upgrade Gloo Edge from version 1.11.28 to version 1.11.30 seamlessly. Their management was crystal-clear on the SLAs, and not a single request shall be mistreated.

Here are a few key figures:

Expected traffic is around 1 billion requests per hour
Requests are evenly distributed to Gloo Gateways, themselves running on a few dozen of nodes
There is a globally available HTTP Load Balancer routing requests to a unique GKE cluster
Clients making the HTTP requests identity themselves with API-keys

Management wants to eliminate all sorts of noise between the clients and the services they consume during this operation. What the SREs want to show to their management is something that green:

Platform Setup

A simple platform is used for this demo and looks like the following:

On our documentation website, we have published general guidelines on how to perform rolling upgrades without downtime; the cloud provider was AWS at that time. In this article, we will use Google Cloud Platform and combine our guidelines for GCP with the guidelines stated just before.

Google Load Balancer

As per this doc, we will create a few GCP resources to configure a global HTTP LB:

a Network Endpoint Group, which represents our Gloo Gateways (dynamic routes to the pods)
a Backend Service and the associated Health Checks
a URL map, a Target HTTP Proxy, and a Forwarding Rule, together forming the Google Load Balancer

Gloo Edge

Gloo Edge is deployed using the Helm values that you will find in the following GitHub repository: https://github.com/solo-io/solo-blog/tree/main/upgrade-gloo-edge-1-billion-req

Note the use of a custom readiness probe which relies on an extra Envoy/Gateway health check filter — you can find more info in this tutorial.

In addition, there is an extra annotation on the Service so that the Google Cloud controller for Kubernetes automatically creates a new Network Endpoint Group.

Backend Service

For this demo, we will use a bare Envoy instance returning a 200 HTTP code for all requests. The Envoy configuration is visible in the GitHub repository under upgrade-gloo-edge-1-billion-req/direct-response-backend.

The Upgrade Demo

With the settings above and the latest improvements of Gloo Edge put together, we expect a clean upgrade (or rollback) with a simple Helm upgrade command:

helm upgrade gloo glooe/gloo-ee --namespace gloo-system --version 1.11.30 \

--create-namespace --set-string license_key="$LICENSE_KEY" -f values-prod-gke-apikey-scaled-up.yaml

The entire demo was recorded and posted on Youtube:

Takeaways for SREs

Upgrading with Helm

In the most recent versions of Gloo Edge (v1.11.27+ and v1.12+), our engineering has improved a lot the upgrade process. There are now Kubernetes Jobs that are responsible for properly upgrading your Gloo resources (the resource-migration Job). The Gloo Custom Resources (Gateways, Upstreams specifically) are no longer rendered directly as part of the Helm manifest; they are applied with kubectl, to ensure correct ordering. This migration Job is to make sure the CRs don’t get deleted during the upgrade.

Health Checks

As depicted in this article, the most important piece is to have health checks configured in your load balancer (Envoy downstream), and between Envoy and the upstream services. This combined with GCP NEGs, you will get a dynamic pool of endpoints in the load balancer and no requests should be sent to dying Envoy instances (thanks to the graceful shutdown).

Troubleshooting

If you see the number of 5xx rising, you need to understand the root cause, which can be tricky to spot. The first thing to do is to enable the %RESPONSE_FLAG% command in the access logs so that Envoy can hint you on the actual issue.

For instance, it can be an issue on the downstream side or an issue on the upstream side. Also, the issue can be at the HTTP level or at the TCP level. Luckily, Envoy has a rich toolset of options ready to help you tackle these issues.

Retries

It’s pretty common to see things improve as soon as you enable retries on the upstream service. There are different conditions that can trigger an HTTP retry, like a connection failure, a 5xx returned by the upstream service, and many more.

Circuit Breaker

Also, you may want to prevent your upstream services from being overwhelmed by retries triggered by your fleet of API Gateways. The maxRetries: 30 option will help to deal with that.

Scale Up

If you see the number of 503 (gateway timeout) increase, you may want to scale up the number of backend services.

Timeout Settings

There are many places where you can tweak timeouts: upstream, downstream, at the connection level or at the HTTP request level, on the main requests or on the health checks, on the retries, etc.

We have put together a list of all the timeouts options available for Gloo Edge on this page: https://docs.solo.io/gloo-edge/master/reference/cheatsheets/timeouts/

CNI

Finally, another point you potentially would like to consider if you have thousands of pods running in your cluster is to adopt an eBPF-based CNI, like Cilium. Cilium performs extremely well at scale, outscoring the good old IPtables in terms of performance.

More generally, see these guidelines for production-grade deployments: https://docs.solo.io/gloo-edge/master/operations/production_deployment/

Final Words

As you can see in this article, Gloo Edge configuration is fully driven by code. Your configuration repository is Git, and Gloo Edge and the GitOps ecosystem are definitely a good fit.

Upgrading your API Gateway under heavy load can help you to maintain business-critical components. Also, note that Gloo Edge leverages the Kubernetes API in many areas: service discovery, proxy readiness probe, configuration management, and other cloud provider-specific annotations, in order to help you better manage your configuration.

Upgrading your API Gateway Under Heavy Load

Expectations for Upgrading Your API Gateway Under Heavy Load

Platform Setup

Google Load Balancer

Gloo Edge

Backend Service

The Upgrade Demo

Takeaways for SREs

Upgrading with Helm

Health Checks

Troubleshooting

Retries

Circuit Breaker

Scale Up

Timeout Settings

CNI

Final Words

Featured content

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Solo Academy Course: Service Mesh Basics

Solo Academy Course: Istio Basics

Solo Academy Course: Envoy Basics

Solo Academy Course: API Gateway Basics

Solo Academy Course: Get Started with Istio Service Mesh

Solo Academy Course: Introduction to Envoy Proxy

Solo Academy Course: Deploying Istio for Production

Kgateway Lab: Integrating kgateway with Istio at Ingress

Kgateway Lab: Kgateway as a Waypoint

Kgateway AI Lab: Consumption Reporting

Kgateway AI Lab: Deploying kgateway as an AI Gateway