Tracing GenAI Applications Is Not Enough

August 21, 2025

Krisztian Fekete

The OpenTelemetry community has jumped quickly to design semantic conventions for generative AI applications.

Most of the work related to this effort has been focusing on covering model and agent trace spans. There’s also some coverage for metrics, and (after many back and forth) Events (logs) are also evolving.

At the same time, many enterprises have had a ticket called “Adopt tracing” in their platform/SRE teams’ backlogs for years, so I am generally happy about how this space is advocating for tracing and hope that it will increase adoption.

This is quite similar to how this new wave of AI hype is able to make more people pay attention to their egress traffic. Again, this is something everyone should have cared about for a long time now.

Traces are great; they provide an easy-to-navigate structure to follow requests trough your systems. In the case of GenAI workloads, such as LLMs, Agents MCP servers, understanding what, why, and how much data has been going trough them is crucial.

However, going all-in on tracing while neglecting all the other signals can come with at least pain points, or, in the worst case, make adopting GenAI literally impossible in certain enterprises or industries.

Input/output messages and security

Let me start with an example of why having only traces can be a critical blocker to enterprise adoption.

Today, most GenAI instrumentation libraries emit everything (at least by default) as span attributes. Which is easy to set up, looks great in local tracing backends e.g., Tempo or Jaeger, and is great for running demos.

However, in enterprise settings, your Security team very likely would like to audit the input/output messages that you or your AI workloads sharing with each other.

These messages include your prompts and can contain (sometimes proprietary) source code, personally identifiable information (PII), and in many cases, they are quite sizable.

The first problem is that if all these are stored as span attributes, you’ll only have them in your telemetry backend. This can be your self-hosted tracing solution, and/or a 3rd party observability vendor.

Usually, Security teams either do not even have access to these systems, or even if they do (mostly when it’s a centralized 3rd party solution that may or may not include their SIEM/logging platform), they cannot properly audit the data, or leverage their extensive tooling on it.

The solution here is making it possible to emit these as logs, so you can ingest them in your SIEM/logging systems, making your Security team happy.

Cost

Depending on how many agents, tools, etc., you are interacting with, spans can grow to a significant size.

Last week, I took a look at one of my trace that covered 3-4 questions and replies with a single agent, and that resulted in almost 40 spans, each with extensive attributes.

Sure, having a proper backend and storage in place can make it possible to ingest and store all this information, and while OTel doesn’t limit this but many observability vendor have limits on trace/span/attribute sizes. If you’re self-hosting this can be a even bigger problem.

Right tool for the job

Depending on your level of operational maturity, you might or might not have a tracing solution to ingest any traces at all (remember my note on this being an everlong backlog item in many places). If this is you, then all these new exciting GenAI capabilities might be the last push to finally adopt traces. But they also might not be.

If you cannot afford to roll out a tracing solution you might still want some visibility into what is happening between your AI workloads. Tracking what workloads talk to what other workloads and how fast, or how much tokens these conversations are burning are the bare minimum that you want to be able to answer in a production setting.

Sure, tracing can help with these, but using metrics for standard metrics use cases makes perfect sense in terms of both operational and $$$ cost.

UI/UX

This last point is not against using traces instead of metrics/logs. It’s about highlighting that the requirements (at least in terms of usability) for GenAI visualisation tools (which are at the moment tracing tools in a new skin) are a bit different than how tracing visualisation tools are built today. Just take a look at Zipkin, Jaeger, and Tempo and see how little innovation have been in this space in the last ~10 years.

When you are observing GenAI flows, some of the information displayed in e.g. Jaeger is just noise. Some other things you just scroll over. Others are hidden behind the arrays of span attributes, and if you drill down all of them, they make understanding the whole flow a pain.

What’s next?

What I am seeing and hearing from others trying to adopt GenAI is that while you can get started with some tooling, there are many gaps to be filled before enterprise adoption can take off.

Will this trend increase tracing adoption? Will we see a new generation of tracing UIs? Will more people move towards having a data lakehouse architecture for all their telemetry? Or, will we have entirely new new ways to solve these use cases?

‍

Featured content

Cloud connectivity done right

Get started

Input/output messages and security

Cost

Right tool for the job

UI/UX

What’s next?

Featured content

Security Holes in MCP Servers and How To Plug Them

Announcing Gloo Mesh Support for Amazon ECS

Gloo Mesh 2.11: Expands Support to Amazon ECS and Brings Multi-Tenant Flexibility to Enterprises.

Reducing the costs and complexity of your cloud native architecture with Ambient Mesh

Introducing Solo Enterprise for agentgateway

Introducing Gloo Gateway 2.0

Ambient mesh deployments made easy with Gloo Operator

Choosing between installation methods in Gloo Mesh: Helm vs. the Gloo Operator

How ambient mesh challenges the security gaps in sidecar workloads

Migrating from sidecars to ambient with zero downtime

Comparing Istio's ambient multicluster support with Gloo Mesh's multicluster peering

The future of Kubernetes is context-aware: Meet Solo Enterprise for kagent

kgateway as Ingress for Ambient Service Mesh

Tracing GenAI Applications Is Not Enough

Gloo Mesh 2.10: More Secure, Scalable Cloud Connectivity

MCP Authorization is a Non-Starter for Enterprise

Securing and Observing Your Services, Simplified

From MCP Servers to Services: Introducing kmcp for Enterprise-Grade MCP Development

The Power of a Single API to Secure, Observe, and Control Traffic in All Directions

Why Building Large Kubernetes Clusters Is (Still) a Bad Idea

Fortifying Your Cloud Native Connectivity Security Posture with Solo and Ambient Mesh

Migrating from Sidecars to Ambient Mesh - Risks, Challenges, and Benefits

Overhaul of Agent Gateway supporting A2A, MCP, and Kubernetes Gateway API

How Ambient Mesh Delivers Advanced Resource and Cost Savings

Getting Started with Ambient Mesh: From 0 to 100 mph

Agent Discovery, Naming, and Resolution - the Missing Pieces to A2A

Part Two: MCP Authorization The Hard Way

Part One: MCP Authorization The Hard Way

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Motive

Confluent

Ingenico

ParkMobile

Vonage

Domino’s Pizza

Introducing Solo Enterprise for agentgateway

Comparing Sidecars with Sidecarless Mesh Implementation

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Introduction to agentregistry