Custom Grafana Dashboards for Envoy Proxy Metrics

How to create custom Grafana dashboards to monitor Envoy Proxy metrics within the Solo.io Gloo Edge or Gloo Gateway API Gateway.

Any DevOps initiative requires providing observability and aggregated metrics for operations teams. Visualizations help you see outliers at a glance so you can find where your network, platform, or application may be experiencing unexpected issues. Here at Solo, we enable our customers to visualize performance metrics that provide insights and highlight problems as they arise so you can correct them. Gloo Edge utilizes Envoy proxy as the API gateway for the application data plane and exposes a wealth of metrics that we can leverage. We will show you how to add custom metrics to Grafana that will automatically be collected for every application you deploy and run with Kubernetes.

Getting Started

This blog post assumes you already have a Kubernetes cluster running and have installed Gloo Edge Enterprise. If you are new to Gloo Edge and want to try this exercise, you can request a 30-day trial license here. We will use a simple example application, the Pet Store, to show metrics on our dashboard addition. Gloo Edge will detect and automatically create an upstream resource for the REST API that Pet Store provides. To deploy the Pet Store app run the following command:

kubectl apply -f https://raw.githubusercontent.com/solo-io/gloo/v1.7.5/example/petstore/petstore.yaml

Wait until the pods in namespace default are ready. Use kubectl get pods -n default to check the status. Now, we can add a route to Gloo Edge by creating a basic VirtualService that looks like this:

apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: petstore
  namespace: gloo-system
spec:
  virtualHost:
    domains:
      - '*'
    routes:
      - matchers:
          - exact: /pets
        routeAction:
          single:
            upstream:
              name: default-petstore-8080
              namespace: gloo-system
        options:
          prefixRewrite: /api/pets

This is a simple route that exposes the API in Gloo Edge. The only feature we have enabled is a prefix rewrite, so we can expose the API with the desired URI, which we will use throughout the workflow. Later, we will explore how to configure other options on this route. We can apply this route to the cluster with the following command:

kubectl apply -f vs-petstore.yaml

Finally, let’s test the application. We will use the nested command %(glooctl proxy url) to determine the http address for the external gateway proxy service and issue curl commands to our route:

$> curl $(glooctl proxy url)/pets
[{"id":1,"name":"Dog","status":"available"},{"id":2,"name":"Cat","status":"pending"}]

The API has been exposed by Gloo Edge. Let’s start exploring the upstream metrics exposed in Grafana.

Grafana Metrics

Gloo Edge automatically captures some upstream metrics for you, so let’s see what is there by default before modifying anything. First, we’ll need to port forward the Grafana service:

kubectl -n gloo-system port-forward deployment/glooe-grafana 3000

Pull up the Grafana dashboard on http://localhost:3000. If this is your first time logging into the Grafana system you can use the default credentials admin/admin. Select the Home menu from the top of the page. Gloo Edge adds the dynamic upstream dashboards into the General section. There, you should be able to find default-petstore-8080_gloo-system. You can also see that there are some default statistics being captured in the dashboard:

Total active connections
Total requests
Upstream Network Traffic

Let’s go back to a terminal window and send some traffic into the application:

while true; do
  curl $(glooctl proxy url)/pets
done

It will take roughly 30 seconds (the default refresh interval) for statistics to update in the graph. You should start seeing an uptick in the graphs.

Inspecting Envoy Proxy Statistics

Gloo Edge also ensures that Envoy Proxy statistics are being captured even if they are not readily visible in the default upstream dashboard. Let’s now take a look at statistics being captured in Prometheus. First, we will create a second port forward to expose Prometheus:

kubectl -n gloo-system port-forward deployment/gateway-proxy 19000

Open a browser window and navigate to: http://localhost:19000/stats/prometheus. It should look like this: Envoy provides quite a bit of data here as you can see. Let’s pick one that makes sense to show on our dashboard. Search for envoy_cluster_external_upstream_rq_time histogram metric within this page. You should see several lines matching the default-petstore-8080_gloo-system Upstream. Let’s use this example to showcase adding metrics to the upstream dashboard.

Developing the Dashboard

To see what the dashboard will show, let’s keep the traffic running while we develop it. First, look at the Prometheus stats for a histogram coming from Envoy. If you take a look at Envoy statistics for envoy_cluster_external_upstream_rq_time you will notice that they are identified by the upstream name, so we can easily find our Pet Store upstream - default-petstore-8080_gloo-system. Next, notice that the counters come in buckets of time. The unit of measurement here is milliseconds. You can also see a sum and a count metric that makes it possible to represent the buckets as a percentage of the total if we wanted to show quantiles.

envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="0.5"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="1"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="5"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="10"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="25"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="50"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="100"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="250"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="500"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="1000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="2500"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="5000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="10000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="30000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="60000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="300000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="600000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="1800000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="3600000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="+Inf"} 1
envoy_cluster_external_upstream_rq_time_sum{envoy_cluster_name="default-petstore-8080_gloo-system"} 0
envoy_cluster_external_upstream_rq_time_count{envoy_cluster_name="default-petstore-8080_gloo-system"} 1

Now, armed with our metrics, let’s create a panel and a visualization. Pick the Heatmap visualization and set the Y axis unit value to milliseconds (ms). Make sure the Data format is set to Time series buckets. For the display, toggle on Show legend, Hide zero and Show tooltip. For the query, we want to distribute the buckets over time so we can see when requests are taking longer. We can use the following query on the gloo datastore:

sum(rate(envoy_cluster_external_upstream_rq_time_bucket{ envoy_cluster_name="default-petstore-8080_gloo-system" }[1m])) by (le)

Finally, set the Legend field to {{ le }} and make sure the Format field is set to Heatmap. It may take a minute to see the visualization reflect our changes, but once you do we can move to the next step.

Modifying the Upstream ConfigMap

First, let’s take a look at what we have created: Cool! We have some nice colors too and we can see that traffic is running smoothly. Let’s now download the JSON so we can add it to the ConfigMap. Use the drop-down menu on the Request Time Heatmap panel and select “More…” and then Panel JSON. Now we are going to edit the ConfigMap for the Upstream and add this panel to it:

kubectl edit -n gloo-sytem cm gloo-observability-config

Find the panels array within the DASHBOARD_JSON_TEMPLATE. We will add this panel as the first element in the array. A fix is required to one line in the JSON, and that is our Prometheus query expression. As this is a templated ConfigMap, it is deriving the upstream name with {{.EnvoyClusterName}}. So, find the text in the expression with default-petstore-8080_gloo-system and change it to the template name. With this ConfigMap change in place any new upstream that gets added will automatically show the new Heatmap panel. Through this blog we have seen how quick it is to extend observability with Envoy Proxy, Prometheus, and Grafana to more easily see where the performance bottlenecks are in your systems. You can automatically add this functionality to any new applications you deploy with Gloo Edge Enterprise. Please reach out to us and let us know how we can help on your observability journey.

Additional Resources

Dynamically Generated Dashboard in Gloo Edge Enterprise Prometheus Stats for Gloo Edge Enterprise Histograms and Heatmaps in Grafana

Custom Grafana Dashboards for Envoy Proxy Metrics

Getting Started

Grafana Metrics

Inspecting Envoy Proxy Statistics

Developing the Dashboard

Modifying the Upstream ConfigMap

Additional Resources

Featured content

MCP Authorization is a Non-Starter for Enterprise

Securing and Observing Your Services, Simplified

From MCP Servers to Services: Introducing kmcp for Enterprise-Grade MCP Development

The Power of a Single API to Secure, Observe, and Control Traffic in All Directions

Why Building Large Kubernetes Clusters Is (Still) a Bad Idea

Fortifying Your Cloud Native Connectivity Security Posture with Solo and Ambient Mesh

Migrating from Sidecars to Ambient Mesh - Risks, Challenges, and Benefits

Overhaul of Agent Gateway supporting A2A, MCP, and Kubernetes Gateway API

How Ambient Mesh Delivers Advanced Resource and Cost Savings

Getting Started with Ambient Mesh: From 0 to 100 mph

Agent Discovery, Naming, and Resolution - the Missing Pieces to A2A

Part Two: MCP Authorization The Hard Way

Part One: MCP Authorization The Hard Way

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

Ingenico

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Gloo Mesh Lab: OpenTelemetry collectors and relay

Gloo Mesh Lab: Extended telemetry from ztunnel

Gloo Mesh Lab: Configure enhanced waypoint proxies

Gloo Mesh Lab: Multicluster peering

Ambient Mesh Lab: EnvoyFilter Support

Ambient Mesh Lab: SPIRE integration with Gloo Mesh in Istio Ambient Mode

Ambient Mesh Lab: Introduction to ztunnel in Ambient Mesh