How OpenTelemetry drives Gloo Platform's Graph

How OpenTelemetry Drives Gloo Platform's Graph

Observability is one of the key problems that service meshes like Istio can solve easily.

If you are using Istio, all of your applications will produce standardized metrics — regardless of the technology in use — to meet the requirements of the RED (rate, errors, duration) method.

These metrics are essential to operate business critical workloads in production, but can lead to other challenges such as storage and collection at scale, cardinality explosion, and operating actual Prometheus instances to make these metrics queryable.

Since Gloo Platform can be an orchestrator of one or more service meshes, having a scalable telemetry pipeline is crucial to address these challenges.

The platform also has a Graph, where you can understand how the application are depending on each other, and you can quickly identify performance degradations as well.

Challenges

Initially, our Platform wasn’t driven by OpenTelemetry.

Take a look the architecture diagrams to understand where we were coming from.

As you can see, originally our Gloo Agent was responsible for both collecting metrics in the workload clusters and forwarding these to the management cluster.

This approach had two main issues.

Issue #1: Too Many Responsibilities

Gloo Agents were already over-employed, and had two jobs besides collecting the metrics:

Resource discovery and sending this information to the management cluster
Applying resources in the workload clusters

These were full-time jobs already by themselves, so putting more responsibilities to the Agents led to scalability issues in the pipeline.

Issue #2: Lack of Control

The second issue was not having the ability to transform, filter, and integrate this telemetry data.

The Agents didn’t know how to create new labels or how to filter and drop metrics, and pushing the data to multiple locations (long term storages, SaaS observability tools, etc.) was troublesome.

Originally, all the scraped metrics were shipped to the management cluster, where a Prometheus instance scraped the management server, the component that exposed all of these metrics.

Adding and removing scrape targets was not an easy task (forget about Prometheus-like scrape configs), and since everything (we are talking about 10s, or even 100s of Kubernetes clusters with Istio on top of them) was pushed to and exposed from a single destination, oftentimes Prometheus was struggling to perform well.

It was clear that the architecture needed to be revisited to solve these limitations.

Why OpenTelemetry, and what does the new pipeline look like?

After investigating the aforementioned issues, we realized that they can only be resolved by having a dedicated component for the observability tasks.

Leveraging something like Thanos can be also an option, but it’s always better to keep things simple, and focus on the business’ core challenges. Storing metrics for years is not the kind of business Solo.io is in.

We could either offloading these tasks from the Agents and build a new telemetry component from scratch, or we could leverage an existing tool, if such thing exists.

Fortunately, there’s one that’s built for this exact purpose, and it’s called OpenTelemetry (OTel, from now on) Collector.

With OTel in place, this is what our new pipeline looks like:

How does it work?

Default pipeline

We have a default pipeline with a single purpose: collect all the relevant metrics for our Graph in the workload clusters and ship them to the management cluster.

Then, in the management cluster our Prometheus can scrape these metrics, making them available for our UI.

This is done by having the OTel collectors as Daemonsets on the nodes in workload clusters. These are scraping all the interesting metrics targets, including Istio injected workloads, istiod, Cilium components, and the collector itself.

The collectors then apply filters to get rid of all the metrics and labels that we don’t need for our UI. This is what we call the Minimum Metrics Set. The collectors then push these to the management server via an otlp exporter.

On the other side, we have the Gloo Telemetry Gateway, that is again a collector in disguise, just configured differently (e.g. it’s a Deployment). It has an otlp receiver as an input — notice that this is the output of the collectors in the workload clusters, then exposing these metrics to our Prometheus.

Extending the pipeline

Having the default pipeline to drive our Graph is nice, since we have a lot more control than we had before, but now that we have OpenTelemetry in our stack, we cannot stop here, can’t we?

One of the other benefits of having OTel is its vibrant ecosystem. You can take a look at all the various receivers, processors, and exporters in the contrib repository, and you will probably find what you are looking for.

Once you have all the LEGO pieces, you can compose them into pipelines to power other tools as well. Let’s imagine you would want to drive our UI, but you would also want to push everything to a long term storage such as Thanos, or a SaaS provider like DataDog or New Relic. With the help of the pipelines, you can easily achieve this. Your security team also needs your logs after some transformation for their SIEM system? Not an issue!

Conclusion & future ideas

As you can see, introducing OpenTelemetry into our stack has opened the doors to a way more flexible and scalable telemetry solution with an extensive ecosystem used by thousands of engineers every day.

We are just getting started! Be on the lookout for new features such as our new Portal Analytics powered by ClickHouse, or enriching existing metrics with cloud provider metadata debuting in Gloo Platform 2.4.0.

How OpenTelemetry Drives Gloo Platform's Graph

Challenges

Issue #1: Too Many Responsibilities

Issue #2: Lack of Control

Why OpenTelemetry, and what does the new pipeline look like?

How does it work?

Default pipeline

Extending the pipeline

Conclusion & future ideas

Featured content

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Solo Academy Course: Service Mesh Basics

Solo Academy Course: Istio Basics

Solo Academy Course: Envoy Basics

Solo Academy Course: API Gateway Basics

Solo Academy Course: Get Started with Istio Service Mesh

Solo Academy Course: Introduction to Envoy Proxy

Solo Academy Course: Deploying Istio for Production

Kgateway Lab: Integrating kgateway with Istio at Ingress

Kgateway Lab: Kgateway as a Waypoint

Kgateway AI Lab: Consumption Reporting

Kgateway AI Lab: Deploying kgateway as an AI Gateway

Kagent Lab: How to build an AI agent

Kagent Lab: Integrate tools from MCP servers with kagent

Gloo AI Gateway Hands-On Lab: Semantic Caching

Kgateway AI Lab: Credentials Management

Kgateway AI Lab: Prompt Enrichment

Kgateway AI Lab: Prompt Guards

Ambient Mesh Lab: Migrating from Sidecar to Sidecarless

Ambient Mesh Lab: Multi-cluster scalability with Istio Ambient Mesh