How to get Istio traceability with Grafana Tempo and Grafana Loki

(This content was originally published on the Grafana blog.)

“How am I supposed to debug this?”

Just imagine: Late Friday, you are about to shut down your laptop and an issue comes up. Warnings, alerts, red colors. Everything that we, developers, hate the most. The architect decided to develop that system based on microservices. Hundreds of them! You, as a developer, think… why? Why does the architect hate me so much? And then, the sentence: “How am I supposed to debug this?” Of course, we all understand the benefits of a microservice architecture. But we also hate the downsides. One of those is the process of debugging or running a postmortem analysis across hundreds of services. It is tedious and frustrating. Here is an example:

“Where do I start?,” you think. And you select the obvious candidate to start with the analysis: app1. And you read the logs shrinking by time period:

kubectl logs -l app=app1 --since=3h […]

“Nothing. Everything looks normal,” you say. “Maybe, the problem came from this other related service.” And again:

kubectl logs -l app=app2 --since=3h […]

“Aha! I see something weird here. This app2 has run a request to this other application, app3. Let’s see.” And yet again:

kubectl logs -l app=app3 --since=3h […]

Debugging takes ages. There is a lot of frustration until, finally, one manages to figure things out. As you can see, the process is slow. Quite inefficient. Time ago, we did not have logging and traceability in place. Today, the story can be different. Having Grafana Loki, Grafana Tempo and other small tools, we can debug things almost instantly.

Traceability: the feature you need

In order to be able to debug quickly, you need to mark the request with a unique ID. The mark is called Trace ID. And all the elements involved in the request, add another unique ID called span. At the end of the road, you are able to filter out the exact set of traces involved in the request which produced the issue.

And not only that. You can also draw all this in a visualization to easily understand the pieces which compose your system.

How to do this with the Grafana stack

You are going to do a task simulating a microservices system with a service mesh: Istio. You will say: “Istio offers observability through Kiali.” For observability, Istio relies on Kiali. However, in the world of microservices, we should always show that there are alternatives that could fit the requirements, at least as good as the default ones.

Tempo and Loki are good examples of well-done logging and tracing backends. Regardless the performance comparisons, good points to consider are following ones:

Tempo and Loki, both, integrate with S3 buckets to store the data. This relieves you from maintaining and indexing storage that, depending on your requirements, might not be needed.
Tempo and Loki are part of grafana. Therefore, it integrates seamlessly with Grafana dashboards. (Well, being honest here: We all love and use grafana dashboards)

Now, let’s see how you can speed up the debugging process with Istio and the Grafana stack.

This will be your architecture:

Hands on!

Pre-requisites

Kuberentes cluster
Istioctl. The task was developed with v.1.10 (https://github.com/istio/istio/releases)
Helm (https://helm.sh/)

Prepare Istio

You need to have Istio up and running. Let’s install the istio operator:

istioctl operator init

Now, let’s instantiate the service mesh. Istio proxies include a traceID in the `x-b3-traceid`. Notice that you will set the access logs to inject that trace ID as part of the log message:

kubectl apply -f - << 'EOF'
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: istio-operator
namespace: istio-system
spec:
profile: default
meshConfig:
accessLogFile: /dev/stdout
accessLogFormat: |
[%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%" %RESPONSE_CODE% %RESPONSE_FLAGS% %RESPONSE_CODE_DETAILS% %CONNECTION_TERMINATION_DETAILS% "%UPSTREAM_TRANSPORT_FAILURE_REASON%" %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% "%REQ(X-FORWARDED-FOR)%" "%REQ(USER-AGENT)%" "%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%" %UPSTREAM_CLUSTER% %UPSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_REMOTE_ADDRESS% %REQUESTED_SERVER_NAME% %ROUTE_NAME% traceID=%REQ(x-b3-traceid)%
enableTracing: true
defaultConfig:
tracing:
sampling: 100
max_path_tag_length: 99999
zipkin:
address: otel-collector.tracing.svc:9411
EOF

Install the demo application

Let’s create the namespace and label it to auto-inject the istio proxy:

kubectl create ns bookinfo
kubectl label namespace bookinfo istio-injection=enabled --overwrite

And now the demo application bookinfo:

kubectl apply -n bookinfo -f https://raw.githubusercontent.com/istio/istio/release-1.10/samples/bookinfo/platform/kube/bookinfo.yaml

To access the application through Istio, you need to configure it. It is required a Gateway and a VirtualService:

kubectl apply -f - << 'EOF'
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: bookinfo-gateway
namespace: bookinfo
spec:
selector:
istio: ingressgateway # use istio default controller
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "*" # Mind the hosts. This matches all
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: bookinfo
namespace: bookinfo
spec:
hosts:
- "*"
gateways:
- tracing/bookinfo-gateway
- bookinfo-gateway
http:
- match:
- uri:
prefix: "/"
route:
- destination:
host: productpage.bookinfo.svc.cluster.local
port:
number: 9080
EOF

To access the application, let’s open a tunnel to the Istio ingress gateway (the entry point to the mesh):

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

Now, you can access the application through the browser: http://localhost:8080/productpage

Install grafana stack

Now, let’s create the grafana components. Let’s start with Tempo, the tracing backend we have mentioned before:

kubectl create ns tracing
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install tempo grafana/tempo --version 0.7.4 -n tracing -f - << 'EOF'
tempo:
extraArgs:
"distributor.log-received-traces": true
receivers:
zipkin:
otlp:
protocols:
http:
grpc:
EOF

Next component. Let’s create a simple deployment of Loki:

helm install loki grafana/loki-stack --version 2.4.1 -n tracing -f - << 'EOF'
fluent-bit:
enabled: false
promtail:
enabled: false
prometheus:
enabled: true
alertmanager:
persistentVolume:
enabled: false
server:
persistentVolume:
enabled: false
EOF

Now, let’s deploy Opentelemetry Collector. You use this component to distribute the traces across your infrastructure:

kubectl apply -n tracing -f https://raw.githubusercontent.com/antonioberben/examples/master/opentelemetry-collector/otel.yaml
kubectl apply -n tracing -f - << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-conf
labels:
app: opentelemetry
component: otel-collector-conf
data:
otel-collector-config: |
receivers:
zipkin:
endpoint: 0.0.0.0:9411
exporters:
otlp:
endpoint: tempo.tracing.svc.cluster.local:55680
insecure: true
service:
pipelines:
traces:
receivers: [zipkin]
exporters: [otlp]
EOF

Following component is fluent-bit. You will use this component to scrap the log traces from your cluster. Note: In the configuration you are specifying to take only containers which match following pattern /var/log/containers/*istio-proxy*.log

helm repo add fluent https://fluent.github.io/helm-charts
helm repo update
helm install fluent-bit fluent/fluent-bit --version 0.16.1 -n tracing -f - << 'EOF'
logLevel: trace
config:
service: |
[SERVICE]
Flush 1
Daemon Off
Log_Level trace
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port {{ .Values.service.port }}
inputs: |
[INPUT]
Name tail
Path /var/log/containers/*istio-proxy*.log
Parser cri
Tag kube.*
Mem_Buf_Limit 5MB
outputs: |
[OUTPUT]
name loki
match *
host loki.tracing.svc
port 3100
tenant_id ""
labels job=fluentbit
label_keys $trace_id
auto_kubernetes_labels on
customParsers: |
[PARSER]
Name cri
Format regex
Regex ^(?[^ ]+) (?stdout|stderr) (?[^ ]*) (?.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
EOF

Now, the grafana query. This component is already configured to connect to Loki and Tempo:

helm install grafana grafana/grafana -n tracing --version 6.13.5 -f - << 'EOF'
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Tempo
type: tempo
access: browser
orgId: 1
uid: tempo
url: http://tempo.tracing.svc:3100
isDefault: true
editable: true
- name: Loki
type: loki
access: browser
orgId: 1
uid: loki
url: http://loki.tracing.svc:3100
isDefault: false
editable: true
jsonData:
derivedFields:
- datasourceName: Tempo
matcherRegex: "traceID=(\\w+)"
name: TraceID
url: "$${__value.raw}"
datasourceUid: tempo
env:
JAEGER_AGENT_PORT: 6831
adminUser: admin
adminPassword: password
service:
type: LoadBalancer
EOF

Test it

After the installations are completed, let’s open a tunnel to grafana query forwarding the port:

kubectl port-forward svc/grafana -n tracing 8081:80

Access it using the credentials you have configured when you have installed it:

user: admin
password: password

http://localhost:8081/explore

You are prompted to the Explore tab. There, you can select Loki to be displayed on one side and, after click on split, to choose Tempo to be displayed in the other side:

You will see something like this:

Finally, let’s create some traffic with the tunnel we already created to bookinfo application:

http://localhost:8080/productpage

Refresh (hard refresh to avoid cache) the page several times until you can see traces coming into Loki. Remember to add the filter to `ProductPage` to see its access log traces:

Click on the log and a Tempo button is shown:

Immediately, the TraceID will be passed to the Tempo dashboard displaying the traceability and the diagram:

Final thoughts

Having a diagram which displays all elements involved in a request through a microservices increases the speed to find bugs or to understand what happened in your system when running a postmortem analysis. Reducing that time, you increase efficiency so that your developers can keep working and producing more business requirements. In my personal opinion, that is the key point: Increase the business productivity. Traceability and, in this case, grafana stack helps you to accomplish that.

Now, let’s make it production ready.

How to get Istio traceability with Grafana Tempo and Grafana Loki

Traceability: the feature you need

How to do this with the Grafana stack

Hands on!

Pre-requisites

Prepare Istio

Install the demo application

Install grafana stack

Test it

Final thoughts

Featured content

Part Two: MCP Authorization The Hard Way

Part One: MCP Authorization The Hard Way

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Solo Academy Course: Service Mesh Basics

Solo Academy Course: Istio Basics

Solo Academy Course: Envoy Basics

Solo Academy Course: API Gateway Basics

Solo Academy Course: Get Started with Istio Service Mesh

Solo Academy Course: Introduction to Envoy Proxy

Solo Academy Course: Deploying Istio for Production

Kgateway Lab: Integrating kgateway with Istio at Ingress

Kgateway Lab: Kgateway as a Waypoint

Kgateway AI Lab: Consumption Reporting

Kgateway AI Lab: Deploying kgateway as an AI Gateway

Kagent Lab: How to build an AI agent

Kagent Lab: Integrate tools from MCP servers with kagent

Gloo AI Gateway Hands-On Lab: Semantic Caching

Kgateway AI Lab: Credentials Management

Kgateway AI Lab: Prompt Enrichment