Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

In today’s complex application landscapes, observability is crucial for debugging intricate systems. With service mesh architectures, developers have powerful tools to enhance observability and streamline debugging. In this article, we embark on a journey to explore how observability within a service mesh improves application debugging.

In the first article of the series, we explore the benefits and techniques of using observability within a service mesh for effective debugging. We explain how observability helps understand complex systems and how service mesh provides features like distributed tracing, metrics, and logging for valuable insights into application behavior.

Throughout the series, we cover various aspects of observability, including testing in production and live debugging. We’ll leverage observability tools within a service mesh, such as OpenTelemetry, to uncover real-time insights, validate application behavior, and promptly resolve issues, ensuring reliable and high-performing applications.

Each of the posts comes with a workshop for you to try things out.

Understanding Observability and Service Mesh

Observability helps us understand and debug complex systems by providing insights into application behavior. It allows us to identify and resolve issues effectively, ensuring system reliability and performance. With observability, we gain visibility into request flows, errors, and system performance, enabling us to improve our applications.

Service mesh architectures enhance observability by offering features like distributed tracing, metrics, and logging:

Distributed tracing tracks request journeys, identifying bottlenecks and performance issues.
Metrics provide quantitative data on response times, errors, and resource utilization.
Logging captures events and messages, aiding issue tracking and troubleshooting.

In essence, observability helps us understand and debug complex systems, while service mesh architectures provide integration of different observability tools. By leveraging distributed tracing, metrics, and logging, developers gain valuable insights to troubleshoot effectively and ensure optimal system performance.

What is OpenTelemetry

OpenTelemetry (OTel) helps developers achieve observability in their applications by collecting crucial data for understanding and debugging. Instrumenting applications with OpenTelemetry captures telemetry data on requests, errors, and performance, facilitating efficient problem-solving.

The value of OpenTelemetry is enhanced by its compatibility with different service mesh implementations. The service mesh manages communication between application components, and OpenTelemetry seamlessly integrates with them. This flexibility allows developers to choose monitoring tools without impact on the application.

OTel is based on the three pillars of Observability:

Logs capture textual records of events
Traces provide a distributed view of request flows
Metrics quantify performance and behavior

OpenTelemetry Collector

While OpenTelemetry is an open-source observability framework that provides APIs and SDKs for instrumenting applications, the OpenTelemetry Collector is a separate component that collects, processes, and exports telemetry data from various sources, acting as a flexible intermediary with customization capabilities.

Distributed Tracing with OpenTelemetry Collector

OpenTelemetry facilitates distributed tracing in the service mesh. It helps track requests as they move through different parts of our system and applications, providing insights into their flow and behavior.

To enable trace stitching, tracing tools require the passage of information through headers. As a result, developers must design their applications to appropriately propagate the relevant headers.

In OTel, these are:

Trace Context Headers: The most essential headers for trace propagation are traceparent and tracestate. traceparent contains the trace ID, span ID, and trace flags, while tracestate includes additional contextual information.
Correlation Headers: Headers like correlation-id or x-correlation-id help correlate related requests across different services or components.
Baggage Headers: Baggage headers like baggage-{key} are used to propagate custom contextual information throughout the trace.

More information can be found here,

In the depicted image, the Tracing Client establishes a direct connection with the Tracing Backend. However, in complex topologies, such direct connections may not be feasible, especially in distributed systems.

To enable distributed tracing, OpenTelemetry utilizes the OpenTelemetry Collector, which has components like receivers, processors, and exporters. Receivers collect tracing data from various sources, processors enhance and manipulate the data, and exporters send it to external systems or visualization tools.

For example, in Istio, data emitted from the sidecar proxies is captured by the OpenTelemetry Collector, allowing us to visualize the request flow. By integrating with tools like Jaeger or Zipkin, we can gain insights into request journeys, identify performance issues, and resolve errors.

Metrics Collection with OpenTelemetry

OpenTelemetry Collector collects and exports metrics from applications within a service mesh, providing valuable insights into application performance.

Metrics present a different complexity compared to tracing. Unlike tracing, where information is pushed, the model for metrics involves pulling data. Consequently, a client like Prometheus is required alongside the application. To prevent public exposure of metric endpoints, the client is deployed next to the applications and directly targets the container.

The OpenTelemetry Collector separates the application’s deployment location from Prometheus by employing a client that carries out similar tasks to Prometheus.

Metrics exporters in OpenTelemetry Collector send collected metrics to monitoring systems like Prometheus. These systems visualize and analyze the metrics, offering information on application health and performance.

For example, when instrumenting service mesh with OpenTelemetry Collector, we collect metrics for analysis. OpenTelemetry Collector integrates with a service mesh, scraping metrics that are made available from the sidecar proxies next to your services. We export these metrics to Prometheus or Grafana, visualizing response times, error rates, and resource utilization. This helps identify bottlenecks and troubleshoot issues.

Istio empowers operators to enhance the monitoring experience by providing the ability to manipulate the metrics exposed by Envoy, enabling greater control over the data received by backends.

Later in the article, we will explore how Grafana can assist with visualization.

Logging and OpenTelemetry

OpenTelemetry Collector integrates with logging frameworks to capture detailed application logs, aiding in debugging.

As you can see in the picture, without an OTel Collector, you need an agent (i.e. fluentbit) to collect the application logs. That means adding another tool to the stack which then, needs to be maintained.

To implement logging with OpenTelemetry Collector, configure it to collect and forward logs to log analysis tools like ELK Stack, Splunk or Loki.

Transport Telemetry Across Your Complex Topology

Transporting telemetry across a complex topology becomes seamless with the OpenTelemetry Collector. By leveraging this powerful tool, developers can effortlessly combine multiple collectors to gather and transmit telemetry data from various sources. With the OpenTelemetry Collector, the complexities of transporting telemetry across diverse components are efficiently managed, enabling comprehensive observability and streamlined monitoring in even the most intricate environments.

As shown in the picture below, you have the flexibility to configure multiple collectors that transmit Telemetry signals to other collectors across different clusters using the standard OTel protocol, OTLP. This capability centralizes the operational control of communication within a single component, the collector, instead of relying on individual proxies when using a service mesh.

Many popular providers such as Grafana Cloud, Dynatrace, Datadog, New Relic, Instana, and others actively support and facilitate the integration of OpenTelemetry (OTel) by embracing the OpenTelemetry Protocol (OTLP). This allows for easier and more seamless integration of OTel with these providers’ services and tools.

Best Practices for Effective Debugging with OpenTelemetry

Comprehensive instrumentation: Ensure thorough instrumentation of your applications using OpenTelemetry to capture relevant telemetry data. In the case of service mesh (like Gloo Platform), the platform does this for you.
Trace context propagation: Utilize OpenTelemetry’s context propagation mechanism to maintain trace context across distributed components of your application. This allows you to follow the path of a request through different services and identify potential bottlenecks or issues. In the case of service mesh such as Istio and platforms built on top of them (like Gloo Platform), this task is reduced to just propagating headers.
Granular trace sampling: Configure trace sampling rates appropriately to balance the volume of collected traces with the overhead of capturing and processing them. Adjust the sampling rate based on the importance and performance impact of specific operations or services.
Log correlation: Correlate logs with traces by including trace identifiers (such as trace IDs and span IDs) in log entries. This correlation helps in linking log events with specific trace spans, enabling easier troubleshooting and understanding of the request flow.
Error handling and logging: Implement robust error handling mechanisms and log errors and exceptions with relevant context. Use structured logging formats and include essential details such as error codes, timestamps, and relevant request information. This helps in pinpointing the source of errors during debugging.
Custom attributes and metadata: Leverage OpenTelemetry’s ability to add custom attributes and metadata to captured telemetry data. Include additional contextual information, such as user IDs, session IDs, or specific request parameters, to enhance the visibility and understanding of application behaviour during debugging. As well as removing any sensitive data like passwords or keys that should not be exposed.
Visualization and analysis tools: Utilize visualization and analysis tools compatible with OpenTelemetry, such as observability platforms or logging solutions like ELK Stack or Grafana. These tools provide a rich set of features for visualizing and analyzing telemetry data, making it easier to spot anomalies, detect patterns, and identify performance issues.
Collaboration and knowledge sharing: Foster collaboration among developers and teams by sharing telemetry data and insights captured by OpenTelemetry. Collaborative debugging sessions, code reviews, and post-mortem analyses can help in identifying and resolving complex issues more efficiently.

Gloo Platform Observability Workshop

In this workshop, we will deploy the Gloo Platform with some services as part of the mesh. we will deploy the Grafana stack for observability – Loki for logs, Tempo for traces, Prometheus for metrics and Grafana for visualizations.

The goal is to show how straightforward it is to configure Gloo Platform to collect the observability signals (traces, logs and metrics) so that a developer can use them during debugging in case of failures.

By the end of the workshop, we’ll have the architecture that looks like this:

We’ll use the Gloo telemetry collector and configure it to receive traces, metrics, and logs. Here is a visualization of the collector configuration with receivers, processors and exporters:

Notice that the OTLP exporter will send the telemetry to the Gloo Telemetry Gateway.

After that, the configuration for the gateway collector (receivers, processors and exporters) will look like this picture:

Notice that the exporters will dive different signals (logs, traces and metrics) to the different backends.

Finally, in the workshop, you will learn how to correlate data with Grafana.

Here is the generated button to link logs and traces:

And here are the generated buttons to correlate traces and logs:

To start the workshop, please follow this link to the repository

Conclusion

OpenTelemetry offers significant benefits for observability and debugging within a service mesh. It provides a standardized approach to capture metrics, traces, and logs, enabling developers to gain deep insights into their applications’ behavior. By incorporating OpenTelemetry into the development process, developers can proactively monitor and debug their services, ensuring optimal performance.

OpenTelemetry provides a holistic view of the system’s behavior and it gives us visibility into complex architecture. This allows developers to pinpoint performance bottlenecks, and optimize their applications.

In the next article of the series, we’ll shift our focus to an intriguing aspect of application development: Testing in Production. In Debugging: Mastering Testing in Production within a service mesh, we will explore how observability, within the service mesh architecture, enables us to conduct thorough testing in live production environments. We will uncover the benefits of testing in production, discuss different strategies and techniques, and showcase practical examples of how observability empowers developers to validate application behavior and ensure reliability in real-world scenarios. Join us!

Skip to part 3 to see how all of these pieces fit together.

‍

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Understanding Observability and Service Mesh

What is OpenTelemetry

OpenTelemetry Collector

Distributed Tracing with OpenTelemetry Collector

Metrics Collection with OpenTelemetry

Logging and OpenTelemetry

Transport Telemetry Across Your Complex Topology

Best Practices for Effective Debugging with OpenTelemetry

Gloo Platform Observability Workshop

Conclusion

Featured content

Fortifying Your Cloud Native Connectivity Security Posture with Solo and Ambient Mesh

Migrating from Sidecars to Ambient Mesh - Risks, Challenges, and Benefits

Overhaul of Agent Gateway supporting A2A, MCP, and Kubernetes Gateway API

How Ambient Mesh Delivers Advanced Resource and Cost Savings

Getting Started with Ambient Mesh: From 0 to 100 mph

Agent Discovery, Naming, and Resolution - the Missing Pieces to A2A

Part Two: MCP Authorization The Hard Way

Part One: MCP Authorization The Hard Way

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

Ingenico

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Ambient Mesh Lab: EnvoyFilter Support

Ambient Mesh Lab: SPIRE integration with Gloo Mesh in Istio Ambient Mode

Ambient Mesh Lab: Introduction to ztunnel in Ambient Mesh

Solo Academy Course: Service Mesh Basics

Solo Academy Course: Istio Basics

Solo Academy Course: Envoy Basics

Solo Academy Course: API Gateway Basics

Solo Academy Course: Get Started with Istio Service Mesh