The OpenTelemetry community has jumped quickly to design semantic conventions for generative AI applications.
Most of the work related to this effort has been focusing on covering model and agent trace spans. There’s also some coverage for metrics, and (after many back and forth) Events (logs) are also evolving.
At the same time, many enterprises have had a ticket called “Adopt tracing” in their platform/SRE teams’ backlogs for years, so I am generally happy about how this space is advocating for tracing and hope that it will increase adoption.
This is quite similar to how this new wave of AI hype is able to make more people pay attention to their egress traffic. Again, this is something everyone should have cared about for a long time now.
Traces are great; they provide an easy-to-navigate structure to follow requests trough your systems. In the case of GenAI workloads, such as LLMs, Agents MCP servers, understanding what, why, and how much data has been going trough them is crucial.
However, going all-in on tracing while neglecting all the other signals can come with at least pain points, or, in the worst case, make adopting GenAI literally impossible in certain enterprises or industries.
Input/output messages and security
Let me start with an example of why having only traces can be a critical blocker to enterprise adoption.
Today, most GenAI instrumentation libraries emit everything (at least by default) as span attributes. Which is easy to set up, looks great in local tracing backends e.g., Tempo or Jaeger, and is great for running demos.
However, in enterprise settings, your Security team very likely would like to audit the input/output messages that you or your AI workloads sharing with each other.
These messages include your prompts and can contain (sometimes proprietary) source code, personally identifiable information (PII), and in many cases, they are quite sizable.
The first problem is that if all these are stored as span attributes, you’ll only have them in your telemetry backend. This can be your self-hosted tracing solution, and/or a 3rd party observability vendor.
Usually, Security teams either do not even have access to these systems, or even if they do (mostly when it’s a centralized 3rd party solution that may or may not include their SIEM/logging platform), they cannot properly audit the data, or leverage their extensive tooling on it.
The solution here is making it possible to emit these as logs, so you can ingest them in your SIEM/logging systems, making your Security team happy.
Cost
Depending on how many agents, tools, etc., you are interacting with, spans can grow to a significant size.
Last week, I took a look at one of my trace that covered 3-4 questions and replies with a single agent, and that resulted in almost 40 spans, each with extensive attributes.
Sure, having a proper backend and storage in place can make it possible to ingest and store all this information, and while OTel doesn’t limit this but many observability vendor have limits on trace/span/attribute sizes. If you’re self-hosting this can be a even bigger problem.
Right tool for the job
Depending on your level of operational maturity, you might or might not have a tracing solution to ingest any traces at all (remember my note on this being an everlong backlog item in many places). If this is you, then all these new exciting GenAI capabilities might be the last push to finally adopt traces. But they also might not be.
If you cannot afford to roll out a tracing solution you might still want some visibility into what is happening between your AI workloads. Tracking what workloads talk to what other workloads and how fast, or how much tokens these conversations are burning are the bare minimum that you want to be able to answer in a production setting.
Sure, tracing can help with these, but using metrics for standard metrics use cases makes perfect sense in terms of both operational and $$$ cost.
UI/UX
This last point is not against using traces instead of metrics/logs. It’s about highlighting that the requirements (at least in terms of usability) for GenAI visualisation tools (which are at the moment tracing tools in a new skin) are a bit different than how tracing visualisation tools are built today. Just take a look at Zipkin, Jaeger, and Tempo and see how little innovation have been in this space in the last ~10 years.
When you are observing GenAI flows, some of the information displayed in e.g. Jaeger is just noise. Some other things you just scroll over. Others are hidden behind the arrays of span attributes, and if you drill down all of them, they make understanding the whole flow a pain.
What’s next?
What I am seeing and hearing from others trying to adopt GenAI is that while you can get started with some tooling, there are many gaps to be filled before enterprise adoption can take off.
Will this trend increase tracing adoption? Will we see a new generation of tracing UIs? Will more people move towards having a data lakehouse architecture for all their telemetry? Or, will we have entirely new new ways to solve these use cases?