How to turn application logs into useful metrics with Gloo Edge and Grafana

Monitoring your applications is a requirement in the cloud native world, but having access to useful metrics is often a difficult task. For some, the phase when metrics become most useful begins at the end of the development lifecycle, mostly during the day-2 operations (post-deployment). At that time, your operations teams can leverage the metrics to monitor your applications and to have a better understanding of what is happening.

Yet development teams work at the very beginning of the development lifecycle. To have good and meaningful metrics, some iterations and feedback are required. In reality, developers care most about features, tests, and… logs traces! Developers dream of having good log traces because then they can properly debug their applications.

So, how can you improve this experience?

Turning application logs into useful metrics

Promtail is an agent which ships the contents of local logs to a log aggregator. The magic happens when, besides shipping logs, it can also parse them. That allows you to configure transformations which can be translated into meaningful metrics.

To have the picture, think of a log trace. For example:

[ERROR] Error trying to connect to the endpoint. Connection reset.

You could parse the log traces which contains the expression [ERROR] and create a counter metric out of it to be displayed in your favorite monitoring system, such as Grafana and Prometheus.

A typical developer scenario for logs and metrics

One of our customers came with a question. They had configured OPA (Open Policy Agent) for their passthrough server as described in the documentation. Unfortunately, the OPA service was not running when they deployed the Gloo Edge resource (AuthConfig).

The event threw an error in the ExtAuth pod logs:

gloo-system extauth-67b888f686-2s7cr extauth {"level":"error","ts":" [...] ",
"caller":"config/generator.go:114","msg":"Errors encountered while processing new server configuration",
"version":"1.9.7","error":"1 error occurred:\n\t* failed to get auth service for auth config 
with id [gloo-system.passthrough-auth]; this configuration 
will be ignored: failed to create grpc client manager: context deadline exceeded\n\n",
"stacktrace":" [...] "}

The customer was quite interested in having everything covered by metrics. But when they discovered the issue, there were no metrics implemented to catch this scenario. The process to fix this issue involved opening a ticket and waiting for the engineering team to work on it. This delay could be easily tackled with the technique you will see in the next section.

Using Gloo Edge to build metrics from logs

The goal of this workshop is to catch ExtAuth (one of the Gloo Edge components) log traces so that they can be converted into metrics and exposed by a Grafana dashboard. Let’s walk through a common scenario and solution.

Your architecture will look like this:

Your Admin user is in charge of applying GlooEdge resources, while your Devops team is in charge of monitoring the infrastructure. You know that the error log is thrown when you apply an AuthConfig resource. In that resource, you specify the address to the OPA resource, which is not reachable.

Setup the environment

For this workshop you need to have:

  • A Kubernetes cluster
  • Helm
  • A GlooEdge Enteprise license key

NOTE: Occasionally the button to “copy” might not work in some browsers. If so, please, select the text and copy.

Install Gloo Edge with a LICENSE_KEY:

export LICENSE_KEY=<here-your-license>

helm repo add glooe https://storage.googleapis.com/gloo-ee-helm
helm repo update
helm upgrade --install gloo glooe/gloo-ee --namespace gloo-system \
--create-namespace --version 1.10.1 --set-string license_key=$LICENSE_KEY --devel

First, let’s discover the error. Create an application (i.e. httpbin):

kubectl apply -f https://raw.githubusercontent.com/istio/istio/1.12.0/samples/httpbin/httpbin.yaml

And expose it through Gloo Edge: 

kubectl apply -f - <<EOF
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: demo
  namespace: gloo-system
spec:
  virtualHost:
    domains:
      - "*"
    routes:
      - matchers:
          - prefix: /
        options:
          extauth:
            configRef:
              name: passthrough-auth
              namespace: gloo-system
        routeAction:
            single:
              upstream:
                name: default-httpbin-80
                namespace: gloo-system
EOF

Create a file for the AuthConfig manifest so that you can then apply it and delete it easily:

cat << EOF > authconfig.yaml
apiVersion: enterprise.gloo.solo.io/v1
kind: AuthConfig
metadata:
  name: passthrough-auth
  namespace: gloo-system
spec:
  configs:
  - passThroughAuth:
      grpc:
        address: ext-authz.default.svc.cluster.local:9000
        connectionTimeout: 30s
EOF

Now, apply it:

kubectl apply -f authconfig.yaml

Now since the service opa.opa.svc.cluster.local:9191 does not exist, an error is thrown in the ExtAuth pod.

Let’s verify it:

kubectl logs -l gloo=extauth -n gloo-system | grep "failed to create grpc client manager"

And you will see following error:

NOTE: If you do not see the error, wait a bit and try again. Propagating the events can take a bit of time.

gloo-system extauth-67b888f686-2s7cr extauth {"level":"error","ts":" [...] ",
"caller":"config/generator.go:114",
"msg":"Errors encountered while processing new server configuration",
"version":"1.9.7","error":"1 error occurred:\n\t* failed to get auth 
service for auth config with id [gloo-system.passthrough-auth]; this configuration 
will be ignored: failed to create grpc client manager: context deadline exceeded\n\n",
"stacktrace":"  [...]  "}

Now, you need to catch this error with Promtail, parse it as metric, and let Prometheus scrape the metric.

Let’s delete first the AuthConfig which triggered the error, to have a clean scenario:

kubectl delete -f authconfig.yaml

Deploy Prometheus and Grafana. Promtail depends on Prometheus given the ServiceMonitor resource. That is why the installation of Promtail must happen later on.

# Prometheus
helm upgrade --install prometheus -n logging bitnami/kube-prometheus --version 6.6.0 --create-namespace

# Grafana
helm upgrade --install grafana grafana/grafana -n logging --version 6.13.5 --create-namespace -f - <<EOF
datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        uuid: prometheus
        access: proxy
        url: http://prometheus-kube-prometheus-prometheus.logging.svc.cluster.local:9090
        isDefault: true
        editable: true
adminUser: admin
adminPassword: password
service:
  type: LoadBalancer
sidecar:
  dashboards:
    enabled: true
EOF

Finally, you need to install Promtail. Notice in the configuration the regex to catch the log traces you are targeting:

# Promtail
helm upgrade --install promtail -n logging grafana/promtail --version 3.8.1 -f - <<EOF
extraArgs:
  - -dry-run # This blocks the 
config:
  logLevel: info
  serverPort: 9080 # To expose the metrics
  lokiAddress: "http://this-will-be-ignored-due-to-dry-run" # To not deviate from the purpose of the article, you do not integrate with Loki
  snippets:
    pipelineStages:
      - match:
          selector: '{app="extauth"}'
          stages:
          - regex:
              # This is the regex
              expression: ".*(?P<passthrough_not_ready>error.*failed to create grpc client manager).*"
          - metrics:
              passthrough_not_ready:
                prefix: 'gloo_'
                type: Gauge
                # type: Counter
                description: "Total count of log traces when a passthrough service is not ready"
                source: passthrough_not_ready
                config:
                  action: inc
serviceMonitor:
  enabled: true
  interval: 10s
  scrapeTimeout: 5s
EOF

Once you have Grafana ready, access it through your browser:

kubectl  -n logging get svc grafana -o jsonpath='{.status.loadBalancer.ingress[0].*}'

You can use these credentials:

Username: admin
Password: password

Create a dashboard with following specifications:

  • Query:
count(gloo_passthrough_not_ready)

Panel configuration. The most relevant part is this:

You now have all the integration ready.

Set the dashboard to show the last 5 mins and you should see the panel:

Testing your configuration

Let’s verify that the dashboard shows an error once it occurs.

Apply the AuthConfig resource:

kubectl apply -f authconfig.yaml

Port-forward to access Promtail’s exposed metrics:

kubectl --namespace logging port-forward daemonset/promtail 9080

And check the metrics in your browser: http://127.0.0.1:9080/metrics

:You should be able to see your new metric

If you go to your new Grafana dashboard you should see the error:

Once you deploy the Authz service, there will be no more errors in the logs. After the event is propagated, the dashboard turns to No Error.

kubectl apply -f https://raw.githubusercontent.com/istio/istio/master/samples/extauthz/ext-authz.yaml

The creation of the Alert to handle this Error is left to the reader.

Level up for more advanced metrics use cases

The case above is straight forward. A more complex scenario is to leverage Header Transformations and then to Enrich Access Logs. The goal here is to have that processed data in the logs. Once in there, you can create metrics out of them.

Examples of usage can be to perform Analytics to understand the usage of your system based on requests.

In an e-commerce system:

  • How many times a day Product A (my-e-commerce.com/products/product-a) is more visited compared to Product B (my-e-commerce.com/products/product-b)

An analytics example with Istio and an eCommerce website will be published shortly in the official Grafana blog site. Stay tuned.

Thanks to Neeraj Poddar for his feedback!