Custom Grafana Dashboards for Envoy Proxy Metrics

Any DevOps initiative requires providing observability and aggregated metrics for operations teams. Visualizations help you see outliers at a glance so you can find where your network, platform, or application may be experiencing unexpected issues. Here at Solo, we enable our customers to visualize performance metrics that provide insights and highlight problems as they arise so you can correct them.  Gloo Edge utilizes Envoy proxy as the API gateway for the application data plane and exposes a wealth of metrics that we can leverage. We will show you how to add custom metrics to Grafana that will automatically be collected for every application you deploy and run with Kubernetes.

Getting Started

This blog post assumes you already have a Kubernetes cluster running and have installed Gloo Edge Enterprise. If you are new to Gloo Edge and want to try this exercise, you can request a 30-day trial license here.

We will use a simple example application, the Pet Store, to show metrics on our dashboard addition.  Gloo Edge will detect and automatically create an upstream resource for the REST API that Pet Store provides.  To deploy the Pet Store app run the following command:

kubectl apply -f https://raw.githubusercontent.com/solo-io/gloo/v1.7.5/example/petstore/petstore.yaml

 

Wait until the pods in namespace default are ready.  Use kubectl get pods -n default to check the status.  Now, we can add a route to Gloo Edge by creating a basic VirtualService that looks like this:

apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: petstore
  namespace: gloo-system
spec:
  virtualHost:
    domains:
      - '*'
    routes:
      - matchers:
          - exact: /pets
        routeAction:
          single:
            upstream:
              name: default-petstore-8080
              namespace: gloo-system
        options:
          prefixRewrite: /api/pets

 

This is a simple route that exposes the API in Gloo Edge. The only feature we have enabled is a prefix rewrite, so we can expose the API with the desired URI, which we will use throughout the workflow. Later, we will explore how to configure other options on this route. We can apply this route to the cluster with the following command:

kubectl apply -f vs-petstore.yaml

 

Finally, let’s test the application.  We will use the nested command %(glooctl proxy url) to determine the http address for the external gateway proxy service and issue curl commands to our route:

$> curl $(glooctl proxy url)/pets
[{"id":1,"name":"Dog","status":"available"},{"id":2,"name":"Cat","status":"pending"}]

 

The API has been exposed by Gloo Edge. Let’s start exploring the upstream metrics exposed in Grafana.

Grafana Metrics

Gloo Edge automatically captures some upstream metrics for you, so let’s see what is there by default before modifying anything. First, we’ll need to port forward the Grafana service:

kubectl -n gloo-system port-forward deployment/glooe-grafana 3000

 

Pull up the Grafana dashboard on http://localhost:3000. If this is your first time logging into the Grafana system you can use the default credentials admin/admin.

Select the Home menu from the top of the page. Gloo Edge adds the dynamic upstream dashboards into the General section. There, you should be able to find default-petstore-8080_gloo-system.

Gloo Edge dynamic upstream dashboards

You can also see that there are some default statistics being captured in the dashboard:

  • Total active connections
  • Total requests
  • Upstream Network Traffic

Default network statistics Grafana

Let’s go back to a terminal window and send some traffic into the application:

while true; do 
  curl $(glooctl proxy url)/pets
done

 

It will take roughly 30 seconds (the default refresh interval) for statistics to update in the graph. You should start seeing an uptick in the graphs.

Grafana traffic dashboard

Inspecting Envoy Proxy Statistics

Gloo Edge also ensures that Envoy Proxy statistics are being captured even if they are not readily visible in the default upstream dashboard.  Let’s now take a look at statistics being captured in Prometheus. First, we will create a second port forward to expose Prometheus:

kubectl -n gloo-system port-forward deployment/gateway-proxy 19000

 

Open a browser window and navigate to: http://localhost:19000/stats/prometheus. It should look like this:

Prometheus Envoy Proxy metrics

Envoy provides quite a bit of data here as you can see. Let’s pick one that makes sense to show on our dashboard. Search for envoy_cluster_external_upstream_rq_time histogram metric within this page. You should see several lines matching the default-petstore-8080_gloo-system Upstream. Let’s use this example to showcase adding metrics to the upstream dashboard.

Developing the Dashboard

To see what the dashboard will show, let’s keep the traffic running while we develop it. First, look at the Prometheus stats for a histogram coming from Envoy. If you take a look at Envoy statistics for envoy_cluster_external_upstream_rq_time you will notice that they are identified by the upstream name, so we can easily find our Pet Store upstream – default-petstore-8080_gloo-system. Next, notice that the counters come in buckets of time. The unit of measurement here is milliseconds. You can also see a sum and a count metric that makes it possible to represent the buckets as a percentage of the total if we wanted to show quantiles.

envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="0.5"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="1"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="5"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="10"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="25"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="50"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="100"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="250"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="500"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="1000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="2500"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="5000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="10000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="30000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="60000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="300000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="600000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="1800000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="3600000"} 1
envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="default-petstore-8080_gloo-system", le="+Inf"} 1
envoy_cluster_external_upstream_rq_time_sum{envoy_cluster_name="default-petstore-8080_gloo-system"} 0
envoy_cluster_external_upstream_rq_time_count{envoy_cluster_name="default-petstore-8080_gloo-system"} 1

Now, armed with our metrics, let’s create a panel and a visualization. Pick the Heatmap visualization and set the Y axis unit value to milliseconds (ms).  Make sure the Data format is set to Time series buckets. For the display, toggle on Show legend, Hide zero and Show tooltip.

For the query, we want to distribute the buckets over time so we can see when requests are taking longer.  We can use the following query on the gloo datastore:

sum(rate(envoy_cluster_external_upstream_rq_time_bucket{ envoy_cluster_name="default-petstore-8080_gloo-system" }[1m])) by (le)

 

Finally, set the Legend field to {{ le }} and make sure the Format field is set to Heatmap.  It may take a minute to see the visualization reflect our changes, but once you do we can move to the next step.

Modifying the Upstream ConfigMap

First, let’s take a look at what we have created:

Grafana upstream configmap

Cool! We have some nice colors too and we can see that traffic is running smoothly. Let’s now download the JSON so we can add it to the ConfigMap. Use the drop-down menu on the Request Time Heatmap panel and select “More…” and then Panel JSON.

JSON heatmap

Now we are going to edit the ConfigMap for the Upstream and add this panel to it:

kubectl edit -n gloo-sytem cm gloo-observability-config

 

Find the panels array within the DASHBOARD_JSON_TEMPLATE. We will add this panel as the first element in the array. A fix is required to one line in the JSON, and that is our Prometheus query expression. As this is a templated ConfigMap, it is deriving the upstream name with {{.EnvoyClusterName}}. So, find the text in the expression with default-petstore-8080_gloo-system and change it to the template name. 

With this ConfigMap change in place any new upstream that gets added will automatically show the new Heatmap panel.

Through this blog we have seen how quick it is to extend observability with Envoy Proxy, Prometheus, and Grafana to more easily see where the performance bottlenecks are in your systems. You can automatically add this functionality to any new applications you deploy with Gloo Edge Enterprise. Please reach out to us and let us know how we can help on your observability journey.

Additional Resources

Dynamically Generated Dashboard in Gloo Edge Enterprise

Prometheus Stats for Gloo Edge Enterprise

Histograms and Heatmaps in Grafana