Service Mesh for Developers: Testing in Production

Service Mesh for Developers, Part 2: Testing in Production

Introduction

Welcome to the second of our series on enhancing debugging with observability in a service mesh. In the previous article, we focused on the benefits of observability within the service mesh and how to configure your Gloo Platform to bring those benefits to your infrastructure.

In this article, we focus on testing in production—a powerful approach to validate application behavior in real-world scenarios. Leveraging observability in a service mesh enables thorough testing using production data and traffic without punishing the business.

We explore the benefits and the role of observability in testing strategies, utilizing approaches like distributed tracing and monitoring to gain insights, identify bottlenecks, and assess changes in a live production environment.

We discuss strategies such as canary releases, A/B testing, dark launches, and feature toggles, providing practical examples that highlight observability’s role in ensuring application safety and mitigating risks during testing.

What is Testing in Production

Testing in production has emerged as a paradigm shift in application testing methodologies. Historically, the traditional approach involved rigorous testing in controlled staging environments before deploying applications to the production environment. However, this approach often fails to replicate real-world scenarios and fails to uncover critical issues that may only arise in a live production environment. Besides, it is typically painful to maintain multiple environments at the same time and keep them in sync.

Testing in production involves conducting controlled tests directly within the live production environment. It aims to validate application behavior, performance, and reliability under real-world conditions. By leveraging production data and traffic, developers gain valuable insights into the application’s actual behavior and uncover issues that may remain undetected in staged environments.

The fear associated with testing in production stems from concerns about the potential impact on end users and critical systems. The thought of introducing untested changes directly into a live environment raises concerns about system stability, performance degradation, and customer experience. The fear of causing downtime or negatively affecting users’ trust in the application can be daunting.

However, with the advancements in observability and the advent of service mesh architectures, developers now have the means to conduct controlled and safe testing in production.

Let’s look at some real-world examples:

In 2012, Facebook experienced a major outage when a bug introduced during a test in the production environment caused cascading failures. The incident affected millions of users and highlighted the risks associated with conducting tests directly in the live environment.
AWS S3 Incident: In 2017, Amazon Web Services (AWS) experienced a significant outage that impacted a large number of websites and services relying on the AWS infrastructure. The incident was caused by a human error during debugging in the production environment, leading to the temporary unavailability of the S3 storage service in the affected regions.
In 2013, Google conducted a live test in production to assess the impact of a code change on search rankings. The test revealed a significant issue that caused incorrect search results. The prompt discovery led to immediate resolution, improving the accuracy and reliability of Google’s search engine.
In 2018, a cloud-based storage service performed live testing to simulate heavy load conditions. During the test, a scalability bottleneck was discovered that could have severely impacted performance during peak usage. The issue was promptly addressed, ensuring optimal performance for users.
In 2021, a mobile banking application performed live testing to simulate real-time transaction processing. The test uncovered a rare race condition that could lead to incorrect balance calculations. The issue was quickly addressed, ensuring accurate and reliable financial transactions for users.
Cloudflare DNS Resolution Outage: In 2019, Cloudflare, a leading internet security and CDN provider, experienced a DNS resolution outage that impacted access to numerous websites and services. The incident was attributed to a code deployment issue during testing in the production environment.

In the examples, testing in production sometimes resulted in service disruptions. However, it also exposed hidden vulnerabilities that could have caused significant damage if left undiscovered.

While the fear of testing in production is understandable, embracing a well-planned and cautious approach, backed by comprehensive observability, can help mitigate risks and ensure a reliable and robust application in real-world scenarios.

Leveraging Observability and Service Mesh for Testing in Production

Observability is crucial for testing in production, offering real-time insights into application behaviour, interactions, flows, bottlenecks, and anomalies. It empowers developers to make data-driven decisions, ensuring optimal performance. Distributed tracing and monitoring tools identify issues and enable prompt resolution.

Embracing observability and Testing in production enables swift detection and immediate action. For example, if a deployment causes a memory spike, quick reversion is possible. This dynamic approach ensures efficient testing and real-time resolution.

Implementing Testing in Production Strategies

Practical testing in production requires diverse strategies like canary releases, A/B testing, dark launches, and feature toggles, minimizing risks and ensuring a smooth user experience.

Observability enables their implementation by providing real-time insights into application behavior, request flows, bottlenecks, and changes. Observability in a service mesh like Istio enhances performance and reliability.

There is a lot of content on the Internet about the strategies. Let’s quickly review them:

Blue-Green Strategy

Blue-green is the basic strategy where we have two environments, one marked with the color Blue and the other one marked with the color Green. As an anecdote, colors were picked to show neutrality in between Environment A and Environment B (yes, in 2005 they were entire environments).

Canary Strategy

Probably the most used in-practice strategy. GitOps tools typically implement this approach to facilitate the Continuous Delivery (CD) of an application. Concretely, ArgoCD implements it in its Argo Rollouts. The idea comes from the miners carrying a canary bird to tell them if a tunnel fills with toxic gas. With its small lungs, the canary would suffocate before the miners so this would alert them beforehand.

The key aspect of this strategy is the extensive collection of metrics, which acts as our canary. By gradually shifting traffic in small increments, such as waves of 10%, we allow ourselves the opportunity to roll back to the previous version if any metrics alerts indicate issues during the traffic shift. If there are no alerts, we continue increasing the traffic split until we’re left with 100% traffic going to the new version of the application.

A/B Testing

The A/B testing technique comes from marketing. The idea is to compare two versions (A and B) of a webpage, feature, newsletter, or design element. By randomly assigning users to different versions, statistical analysis is performed to determine which version performs better based on predefined metrics, allowing data-driven decision-making and optimization.

Dark Launch

Coined by Facebook, Dark Launch and A/B testing share similarities in exposing a subset of users to new features. However, dark launch primarily introduces completely new features, rather than minor tweaks. A/B testing focuses on optimizing existing features for better business outcomes, while dark launches explore opportunities for market expansion through the introduction of new features.

Feature Toggles

To effectively implement this strategy, it may be necessary to combine it with another approach such as A/B Testing. This is because the condition is evaluated at the code level. In such cases, it is preferable to deploy a separate version of the application and determine which version to serve based on routing.

Mitigating Risks and Ensuring Application Safety

Here are a couple of best practices to mitigate risks when testing in production:

Gradually roll out changes to a subset of users or traffic using canary releases.
Implement feature toggles to enable/disable specific features dynamically during testing.
Monitor key metrics, such as latency and error rates, to detect anomalies and performance degradation.
Establish clear rollback and mitigation plans to quickly address issues if they arise.
Collaborate closely between development, operations, and observability teams for effective communication and decision-making.
Use distributed tracing to trace request flows and identify bottlenecks in the service mesh environment.
Leverage logging and monitoring tools to gain real-time insights into the behaviour and performance of the application during testing.
Set up proper alerting mechanisms to promptly notify teams of any anomalies or issues.
Conduct thorough pre-production testing to identify potential issues before deploying changes to the live environment.
Regularly review and update security measures to protect user data and ensure compliance during testing in production.

The Complex Case of the Database

Testing in production poses challenges when it comes to interacting with the database. Using a separate data source for testing can introduce inconsistencies. Despite implementation complexities, the preferred approach is to test directly with the production database. Reading data from the production database is straightforward if done with minimal read contention. However, testing write operations is trickier. One method involves creating dedicated testing users within the production database to operate on data without affecting real user data. This approach can be daunting as test operations run alongside live data.

Testing in Production Workshop

Testing in Production workshop aims to provide you with practical insights and hands-on experience in leveraging observability tools within a service mesh (Gloo Platform) to enhance your testing strategies in a production environment.

With the application deployed in the previous post of the series, you will deploy an app-v2 and test it in production using one of the strategies described in this post (dark launch).

The goal is to show how simple is to configure the Gloo Platform to execute these strategies and speed up the development lifecycle.

From the previous article, you learned how to leverage Observability and OpenTelemetry. Now, you will be able to use that capability to discover bugs in a testing application.

To start the workshop, please follow this link to the repository

Conclusion

Testing in production is crucial, and observability significantly enhances testing strategies in a service mesh environment (like Istio). It is important to strike a balance, understanding the benefits and risks of testing in a live production setting. By leveraging observability tools, developers can create robust testing practices that ensure the reliability and high performance of applications.

Embracing observability enables real-time insights, facilitates data-driven decisions, and empowers developers to address issues promptly. By adopting a thoughtful and informed approach, developers can embrace the power of testing in production while delivering exceptional user experiences.

In the final article, we’ll explain how all these practices blend into the big picture of app development.

Service Mesh for Developers, Part 2: Testing in Production

Introduction

What is Testing in Production

Leveraging Observability and Service Mesh for Testing in Production

Implementing Testing in Production Strategies

Blue-Green Strategy

Canary Strategy

A/B Testing

Dark Launch

Feature Toggles

Mitigating Risks and Ensuring Application Safety

The Complex Case of the Database

Testing in Production Workshop

Conclusion

Featured content

How Ambient Mesh Delivers Advanced Resource and Cost Savings

Getting Started with Ambient Mesh: From 0 to 100 mph

Agent Discovery, Naming, and Resolution - the Missing Pieces to A2A

Part Two: MCP Authorization The Hard Way

Part One: MCP Authorization The Hard Way

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

Ingenico

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Ambient Mesh Lab: Introduction to ztunnel in Ambient Mesh

Solo Academy Course: Service Mesh Basics

Solo Academy Course: Istio Basics

Solo Academy Course: Envoy Basics

Solo Academy Course: API Gateway Basics

Solo Academy Course: Get Started with Istio Service Mesh

Solo Academy Course: Introduction to Envoy Proxy

Solo Academy Course: Deploying Istio for Production