Best Practices for Large-Scale API Gateway Deployments

Solo has been supporting customers in safely deploying cloud-native API gateways since 2019. Over that time we have learned a lot from our most demanding users in large-scale environments. This article distills that hard-won experience down to a set of digestible best practices for large-scale API gateway deployments on Gloo Edge. It is in no way comprehensive but it provides you with a place to start your journey with the guidelines that offer the best return for your engineering investment.

Architect Carefully

Consider a Tiered Gateway Architecture

Most new Gloo Edge users begin with modest, single-cluster Kubernetes deployments as depicted in the diagram below. They typically employ a single Envoy proxy instance within the cluster, served by a single Gloo Edge control plane, and often fronting multiple workloads, both within the cluster and without.

As their usage of Gloo Edge expands, they often on-board additional application teams, who have their own workloads deployed in separate clusters as shown below. As request volumes increase within each cluster, it may even be necessary to horizontally scale the data plane with additional Envoy instances. To distribute traffic across these clusters, a simple L4 load balancer may be used to handle that basic routing. For AWS users, a Network Load Balancer is a frequent and efficient choice.

As users grow more sophisticated, it often makes sense to delegate more responsibility to the initial routing tier. For example, a platform security team might want to apply universal security policies that span all applications in their domain.

Web Application Firewall rules might help thwart zero-day vulnerabilities like Log4Shell that compromised many enterprise systems in late 2021. These could be deployed rapidly at an outer gateway tier to protect Java systems that had not yet had time to apply Log4J patches or modify their configurations.

Many Solo customers want to normalize their authNZ strategies across all applications. Or provide a baseline set of rate limiting rules to protect all Internet-facing services against DDoS attacks.

These types of use cases lead large organizations to consider a tiered gateway approach as depicted below. The API Gateway Routing Tier often operates in a dedicated cluster, and contains security and routing policies as the initial landing point for user requests. It then routes those requests to workload clusters, whose routing policies can then be focused more on application-specific requirements.

In short, a tiered architecture often allows for better separation of concerns across application and platform teams, enabling each group to focus more on its core mission.

Consider Gloo Mesh for superior multi-tenant support

For organizations that require more sophistication in its approach to tiered gateways and multi-cluster deployments, migrating to Gloo Mesh is a potential next step on this journey. Gloo Mesh relies on the Istio control plane, the most battle-tested control plane for Envoy proxies anywhere.

It also supports a concept called Workspaces, a Kubernetes Custom Resource that allows namespaces across multiple clusters to be separated into logical groups. Explicit import and export of resources are required for resources to be shared across workspaces.

Plan for Multi-Cluster Operations

For large-scale deployments considering how to operate in a multi-region context, multi-cluster architectures are a must. Managing service failover is typically one of the first questions asked.

Using Gloo Edge, the Gloo Federation feature is the best approach. GlooFed manages the configuration of multiple Edge instances in a single place, regardless of underlying platform. GlooFed enables users to create global routing policies that can span different Edge instances across Kubernetes clusters. The full GlooFed documentation is available here.

Multi-cluster operations are another area where large enterprises should seriously consider adopting Gloo Mesh. In addition to supporting equivalent API gateway capabilities to Gloo Edge, it offers richer abstractions to support cross-cluster failover that leverage the Istiod control plane to manage its Envoy proxy fleet. Check out the Gloo Mesh documentation for more details on configuring these types of routing policies for both north-south and east-west scenarios.

Scale Predictably

Large-scale Gloo Edge deployments typically require flexibility around scaling options to support demanding workloads. This section outlines the most commonly used of these features.

Scale the Data Plane

One of the first question our customers are generally asking us is how many Envoy proxy instances (the gateway-proxy pod in Gloo Edge) are needed to support their workloads. We generally answer that they need two instances to achieve high availability. But it is rare that an organization needs to deploy more instances strictly for performance reasons. A more thorough analysis of the scaling characteristics of Envoy proxies with Gloo Edge is available here.

Envoy proxies scale extraordinarily well and typically in a linear fashion as depicted above.

If your environment requires more than one Envoy proxy, there are a couple of models that you can use.

The simplest model is to specify a static number of Envoy replicas. If you’re installing with Helm charts, this setting is the one you want: gloo.gatewayProxies.gatewayProxy.kind.deployment.replicas.

NOTE: All Gloo Edge Enterprise Helm chart values are documented here.

There are more complex scaling models available with Helm chart customization as well. For example, the gloo.gatewayProxies.gatewayProxy.horizontalPodAutoscaler settings below allow you to configure a dynamic autoscaling model. Specify a minimum and maximum number of replicas, along with an Envoy CPU percentage threshold at which you want to scale up (or down) the number of proxy replicas.

Helm Setting Value
...apiVersion autoscaling/v1
...minReplicas minimum replica count
...maxReplicas maximum replica count
...targetCPUUtilizationPercentage CPU % to trigger scaling action

There are more sophisticated models available in the autoscaling/v2beta2 apiVersion that provide additional control over the metrics used to choose scaling actions and to specify thresholds in these metrics to separately control both scaling up and scaling down operations. All Helm chart customization values are documented here.

Scale the ExtAuth Service

Another common element of the data plane that requires scaling in large deployments is the ExtAuth service.

Gloo Edge Enterprise provides a variety of authNZ options to meet the needs of your deployments. Architecturally, Gloo Edge uses a dedicated ExtAuth server to verify user credentials and determine their authorizations.

While some authentication solutions, such as JWT verification, can occur directly in Envoy, many authNZ decisions are delegated to an external service. Envoy supports an external auth filter that reaches out to another service to authenticate and authorize requests, as a general solution for handling a large number of authNZ requests at scale. Gloo Edge Enterprise delivers a built-in ExtAuth server that supports most standard authNZ use cases, as well as a plugin framework for customization.

The graphic below provides context on how and when authNZ policies are evaluated by Gloo Edge and processed by Envoy relative to other security features.

The number of ExtAuth service replicas in Gloo Edge is controlled by this Helm chart setting: global.extensions.extAuth.deployment.replicas. More sophisticated models for scaling ExtAuth replicas are unavailable at this time.

Performance and scale testing results for authNZ data path services are available in this blog post. Full ExtAuth documentation is available here.

Scale the Control Plane?

Many new users to Gloo Edge assume that the control plane (gloo pod) will need to be scaled along with the Envoy proxies (gateway-proxy pods) and ExtAuth services (extauth pods).

However, because Gloo Edge is architected to separate the control and data planes, this is generally NOT the case. It is seldom a good idea to scale the gloo deployment above a single instance. Please reach out to Solo field engineering if you conclude that your deployment may be an exception.

Other Performance Tips

Solo maintains a production readiness checklist that should be reviewed before any large-scale deployment. This section highlights a couple of the most common techniques we see employed at scale.

Disable Discovery if not needed

Service discovery is a useful feature for many Gloo Edge customers, especially when starting out and in development environments. It is enabled by default. See an example of its usage here.

However, in large-scale production environments, discovery is much less commonly used. A disciplined organization will typically have all its production Upstreams pre-defined and tested, so discovery serves no purpose other than to consume compute resources. In large environments, these resources can be considerable.

The gloo.discovery.enabled: false Helm setting allows you to switch discovery off. See also the documentation here.

Disable Kubernetes Destinations if not needed

Gloo Edge require Upstreams as routing targets. By default, it can route to Kubernetes destinations that do not have Upstreams explicitly defined. However, this additional flexibility comes with a runtime cost.

As with the discovery discussion above, a disciplined organization will typically have all its production Upstreams pre-defined and tested, so this feature serves no purpose other than to consume compute resources.

You can override the default and disable implicit Kubernetes destination routing with the Helm value settings.disableKubernetesDestinations: true. See more information here.

Upgrade Carefully

Any type of software upgrade raises the risk of service outages. Upgrades for Gloo Edge the product and applications that depend on it as a gateway are no exception. There are two primary options to consider when planning a Gloo Edge upgrades.

Option 1: Consider Zero-Downtime Gateway Rollouts

As a first option, use Solo’s Zero-Downtime principles to manage upgrades as discussed in the Gloo Edge documentation here. The key is to configure health checks, retries, and failovers for the underlying Envoy proxy fleet and your upstream services in order to enable seamless failover to new versions. This YouTube video from Solo chief architect Yuval Kohavi lays out the core principles that are key to success with zero-downtime rollouts like this.

Option 2: Consider Canary Upgrades Across Product Versions

If Option 1 is inadequate for your situation, then consider Gloo-managed canary upgrades. Gloo Edge 1.9+ adds the ability to update your product deployments with a canary model. With canaries you have two different-version deployments in your data plane, and can check that the deployment at the latest version handles traffic as you expect before completely cutting over to the latest version. See this documentation for details.

Any type of software upgrade raises the risk of service outages. Upgrades for Gloo Edge the product and applications that depend on it as a gateway are no exception.

Consider Gloo Mesh for Automatic Version Management

Gloo Mesh uses the battle-hardened Istio control plane to program the Envoy proxies that sit at the edge of its service meshes. A soon-to-be-released feature allows organizations to automate its product upgrades using Istio Lifecycle Management, even across multiple Kubernetes clusters.

If product version upgrades at scale are a significant concern, then evaluating Gloo Mesh to support Edge-like API gateway functionality with automatic upgrades should be considered.

Use Canary Upgrades for Applications

Gloo Edge has an even longer history of supporting canary upgrades for upstream applications than for its product releases.

Gloo Edge supports user-managed canary releases, where an operations engineer manages the cutover progress of the new service release. In addition, automated progressive canary techniques such as those popularized by Weaveworks Flagger are also supported in Edge.

Source: https://www.weave.works/oss/flagger/

Full documentation on application canary upgrades are available here. A 3-part Solo blog series starting here also provides detailed examples.

Both product and application upgrades are also enhanced by robust active and passive health check protocols that help Envoy take quick action to remove upstream services that are unhealthy or being retired. See more in the later section on Enhancing Reliability and in this Gloo Edge documentation on navigating zero-downtime upgrades.

Configure Reliably

GitOps principles like declarative configuration, versioned and immutable desired state, and continuous reconciliation of changes are gaining wide acceptance in the enterprise IT community, especially to manage large-scale deployments. Solo strongly encourages adoption of these principles in its customer base.

Consider GitOps with ArgoCD

There are multiple GiOps platforms available; one of the most popular is ArgoCD. Solo offers resources for enterprises looking to manage Gloo product deployment and their own service deployments with Gloo Edge. The gitops-library repo offers examples of managing products and application services using ArgoCD. The ArgoCD console screenshot below depicts management of the core Gloo Edge product using gitops-library.

NOTE: The gitops-library repo is maintained by Solo field engineers on a best-effort basis; it is not part of the supported Gloo Edge product.

Another tutorial resource for getting started with Gloo Edge product and application configuration with ArgoCD is available in this this blog post.

Consider GitOps with Weaveworks Flux and Flagger

Weaveworks with its Flux and Flagger open-source frameworks are also popular among Solo customers looking for reliable deployment patterns.

Solo has provided introductory content around these platforms here and here, and has discussed multi-cluster operations with GitOps here. You might also learn from this joint talk with a Weaveworks engineer at SoloCon 2022.

Beware Default Configurations

Gloo Edge strives to provide a “Batteries Included” experience for the default product installation. That means that capabilities like rate limiting and observability with Prometheus and Grafana will work without modification out-of-the-box.

Rate limiting is enabled by default within Gloo Edge by a nominally configured rate-limit server with a supporting redis cache to manage the request counts. This is never the configuration you want to advance into production. Most large-scale users will attach to an externally configured, enterprise-scale Redis or DynamoDB service, and in some cases scale up the rate-limit server as well. The Gloo Edge documentation describes options for configuring a scalable rate limiting deployment here.

As with rate limiting, Gloo Edge deploys nominally configured Prometheus and Grafana instances to enable a “Batteries Included” initial experience. In addition, Gloo Edge produces Grafana dashboards that leverage generated metrics to deliver insights into the traffic it manages.

Gloo Edge-generated Grafana dashboard sample

However, these nominal instances are practically never what you want in production, especially with a large-scale deployment. The Gloo Edge documentation describes the best practice of connecting to your organization’s properly scaled Prometheus and Grafana services. The Edge production readiness documentation also discusses this topic.

Observe and Monitor Maniacally

Deep observation and intelligent monitoring are critical to the success of any large-scale application network deployment. Gloo Edge users in these environments invariably leverage log collection and monitoring platforms like New Relic, Datadog, and/or Splunk.

Envoy proxy produces a wealth of metrics as cataloged in its documentation. Envoy instances managed by Gloo Edge publish these statistics in addition to the ones produced by the Gloo Edge control plane. All of these observability metrics can be scraped using your Prometheus instance.

Envoy can produce customizable Access Log records for each request that is processed by its proxies. Edge access logging is discussed here, and the Envoy documentation describing the details of the format strings is available here.

Enhance Reliability

One mark of a mature Gloo Edge user is adopting some simple but often overlooked practices that help to ensure overall system reliability. Three of the most effective of these practices are active health checks against upstreams, passive health checks (i.e., outlier detection), and service retries.

Active and passive health checks can be used simultaneously and together form a robust defense against Envoy routing traffic to unhealthy upstreams. This document demonstrates how to configure Gloo Edge upstreams with both active and passive health checks. To understand in depth what’s happening in Envoy beneath the covers, we highly recommend this on Envoy health checking in general, and also this on Envoy outlier detection.

In some cases, even with a thorough health check protocol in place, upstream systems are sometimes unreliable. In these cases, per-route service retries can be applied to ensure that transient or intermittent failures do not unnecessarily cause client request to fail. Learn more about configuring retries in Gloo Edge here.

Delegate Effectively

Large-scale deployments always require special attention to multi-tenant, “noisy neighbor” issues. In addition to supporting different styles of big-picture architecture as discussed earlier, Gloo Edge provides lower-level facilities to support easier delegation across teams within a large organization.

Use Route Table Delegation

How can multiple tenants add, remove, and update routing rules without requiring shared access to the root-level Virtual Service?

How can common route configuration components be shared across Virtual Services?

These capabilities are supported by Gloo Edge Route Tables. They allow a complete routing configuration to be assembled from separate configuration objects. The root object delegates responsibility to other objects, forming a tree of config objects. The tree always has a Virtual Service as its root. The root Virtual Service may delegate to any number of Route Tables, which in turn can further delegate to other Route Tables.

Full Route Table documentation is available here.

Use Invalid Route Replacement

The worst kind of “noisy neighbor” is one who pushes bad configuration and breaks adjacent services. The purpose of the Gloo Edge invalid route replacement is to prevent that outcome. In other words, we must not allow invalid configuration related to one service cause service outages in other services.

The Gloo Edge documentation provides a helpful example of configuring invalid route replacement here.

Summary

This document explored a set of best practices to help ensure success in large-scale Gloo Edge deployments. To summarize:

  • Architect carefully.
    • Consider a tiered gateway architecture.
    • Consider Gloo Mesh for superior multi-tenant support.
    • Plan for multi-cluster operations.
  • Scale predictably.
    • Scale the data plane.
    • Scale the ExtAuth service.
    • Do not scale the control plane as a rule.
    • Disable discovery if not needed.
    • Disable implicit Kubernetes destinations if not needed.
  • Upgrade carefully.
    • Consider zero-downtime gateway rollouts
    • Consider canary upgrades across product versions.
    • Consider Gloo Mesh for automatic version management.
    • Use canary upgrades for applications.
  • Configure reliably.
    • Consider GitOps with ArgoCD.
    • Consider GitOps with Weaveworks Flux and Flagger.
    • Beware default Gloo Edge configurations.
  • Observe and monitor maniacally.
  • Enhance reliability.
    • Use active health checks.
    • Use passive health checks / outlier detection.
    • Use service retries.
  • Delegate effectively.
    • Use Route Table delegation.
    • Use Invalid Route Replacement.

Learn More

If you’d like to learn more, we recommend staying abreast of these Gloo Edge production deployment guidelines, from which we borrowed liberally in this document. Also check out Envoy’s own best practices document for deploying edge gateway proxies.

Finally, the SoloCon 2022 conference featured presentations directly from a number of large Gloo Edge customers, discussing their successes and lessons learned along the way. Check out the recordings of these experience-driven sessions from T-Mobile, Carfax, Constant Contact, and USAA.

Acknowledgments

Thank you to all the Gloo Edge users who have directly and indirectly contributed to the lessons learned in this post. And a Special Thank You to my colleague Kevin Shelaga for expanding and refining my early ideas that ultimately turned into this article.