Service Mesh for Developers, Part 2: Testing in Production

Introduction

Welcome to the second of our series on enhancing debugging with observability in a service mesh. In the previous article, we focused on the benefits of observability within the service mesh and how to configure your Gloo Platform to bring those benefits to your infrastructure. 

In this article, we focus on testing in production—a powerful approach to validate application behavior in real-world scenarios. Leveraging observability in a service mesh enables thorough testing using production data and traffic without punishing the business. 

We explore the benefits and the role of observability in testing strategies, utilizing approaches like distributed tracing and monitoring to gain insights, identify bottlenecks, and assess changes in a live production environment. 

We discuss strategies such as canary releases, A/B testing, dark launches, and feature toggles, providing practical examples that highlight observability’s role in ensuring application safety and mitigating risks during testing. 

What is Testing in Production

Testing in production has emerged as a paradigm shift in application testing methodologies. Historically, the traditional approach involved rigorous testing in controlled staging environments before deploying applications to the production environment. However, this approach often fails to replicate real-world scenarios and fails to uncover critical issues that may only arise in a live production environment. Besides, it is typically painful to maintain multiple environments at the same time and keep them in sync.

 

Testing in production involves conducting controlled tests directly within the live production environment. It aims to validate application behavior, performance, and reliability under real-world conditions. By leveraging production data and traffic, developers gain valuable insights into the application’s actual behavior and uncover issues that may remain undetected in staged environments.

The fear associated with testing in production stems from concerns about the potential impact on end users and critical systems. The thought of introducing untested changes directly into a live environment raises concerns about system stability, performance degradation, and customer experience. The fear of causing downtime or negatively affecting users’ trust in the application can be daunting.

However, with the advancements in observability and the advent of service mesh architectures, developers now have the means to conduct controlled and safe testing in production. 

Let’s look at some real-world examples:

  • In 2012, Facebook experienced a major outage when a bug introduced during a test in the production environment caused cascading failures. The incident affected millions of users and highlighted the risks associated with conducting tests directly in the live environment.
  • AWS S3 Incident: In 2017, Amazon Web Services (AWS) experienced a significant outage that impacted a large number of websites and services relying on the AWS infrastructure. The incident was caused by a human error during debugging in the production environment, leading to the temporary unavailability of the S3 storage service in the affected regions.
  • In 2013, Google conducted a live test in production to assess the impact of a code change on search rankings. The test revealed a significant issue that caused incorrect search results. The prompt discovery led to immediate resolution, improving the accuracy and reliability of Google’s search engine.
  • In 2018, a cloud-based storage service performed live testing to simulate heavy load conditions. During the test, a scalability bottleneck was discovered that could have severely impacted performance during peak usage. The issue was promptly addressed, ensuring optimal performance for users.
  • In 2021, a mobile banking application performed live testing to simulate real-time transaction processing. The test uncovered a rare race condition that could lead to incorrect balance calculations. The issue was quickly addressed, ensuring accurate and reliable financial transactions for users.
  • Cloudflare DNS Resolution Outage: In 2019, Cloudflare, a leading internet security and CDN provider, experienced a DNS resolution outage that impacted access to numerous websites and services. The incident was attributed to a code deployment issue during testing in the production environment.

In the examples, testing in production sometimes resulted in service disruptions. However, it also exposed hidden vulnerabilities that could have caused significant damage if left undiscovered.

While the fear of testing in production is understandable, embracing a well-planned and cautious approach, backed by comprehensive observability, can help mitigate risks and ensure a reliable and robust application in real-world scenarios.

Leveraging Observability and Service Mesh for Testing in Production

Observability is crucial for testing in production, offering real-time insights into application behaviour, interactions, flows, bottlenecks, and anomalies. It empowers developers to make data-driven decisions, ensuring optimal performance. Distributed tracing and monitoring tools identify issues and enable prompt resolution.

Embracing observability and Testing in production enables swift detection and immediate action. For example, if a deployment causes a memory spike, quick reversion is possible. This dynamic approach ensures efficient testing and real-time resolution.

Implementing Testing in Production Strategies

Practical testing in production requires diverse strategies like canary releases, A/B testing, dark launches, and feature toggles, minimizing risks and ensuring a smooth user experience.

Observability enables their implementation by providing real-time insights into application behavior, request flows, bottlenecks, and changes. Observability in a service mesh like Istio enhances performance and reliability.

There is a lot of content on the Internet about the strategies. Let’s quickly review them:

Blue-Green Strategy

Blue-green is the basic strategy where we have two environments, one marked with the color Blue and the other one marked with the color Green. As an anecdote, colors were picked to show neutrality in between Environment A and Environment B (yes, in 2005 they were entire environments).

Canary Strategy

Probably the most used in-practice strategy. GitOps tools typically implement this approach to facilitate the Continuous Delivery (CD) of an application. Concretely, ArgoCD implements it in its Argo Rollouts. The idea comes from the miners carrying a canary bird to tell them if a tunnel fills with toxic gas. With its small lungs, the canary would suffocate before the miners so this would alert them beforehand.

The key aspect of this strategy is the extensive collection of metrics, which acts as our canary. By gradually shifting traffic in small increments, such as waves of 10%, we allow ourselves the opportunity to roll back to the previous version if any metrics alerts indicate issues during the traffic shift. If there are no alerts, we continue increasing the traffic split until we’re left with 100% traffic going to the new version of the application. 

A/B Testing

The A/B testing technique comes from marketing. The idea is to compare two versions (A and B) of a webpage, feature, newsletter, or design element. By randomly assigning users to different versions, statistical analysis is performed to determine which version performs better based on predefined metrics, allowing data-driven decision-making and optimization.

Dark Launch

Coined by Facebook, Dark Launch and A/B testing share similarities in exposing a subset of users to new features. However, dark launch primarily introduces completely new features, rather than minor tweaks. A/B testing focuses on optimizing existing features for better business outcomes, while dark launches explore opportunities for market expansion through the introduction of new features.

Feature Toggles

To effectively implement this strategy, it may be necessary to combine it with another approach such as A/B Testing. This is because the condition is evaluated at the code level. In such cases, it is preferable to deploy a separate version of the application and determine which version to serve based on routing.

Mitigating Risks and Ensuring Application Safety

Here are a couple of  best practices to mitigate risks when testing in production:

  • Gradually roll out changes to a subset of users or traffic using canary releases.
  • Implement feature toggles to enable/disable specific features dynamically during testing.
  • Monitor key metrics, such as latency and error rates, to detect anomalies and performance degradation.
  • Establish clear rollback and mitigation plans to quickly address issues if they arise.
  • Collaborate closely between development, operations, and observability teams for effective communication and decision-making.
  • Use distributed tracing to trace request flows and identify bottlenecks in the service mesh environment.
  • Leverage logging and monitoring tools to gain real-time insights into the behaviour and performance of the application during testing.
  • Set up proper alerting mechanisms to promptly notify teams of any anomalies or issues.
  • Conduct thorough pre-production testing to identify potential issues before deploying changes to the live environment.
  • Regularly review and update security measures to protect user data and ensure compliance during testing in production.

The Complex Case of the Database

Testing in production poses challenges when it comes to interacting with the database. Using a separate data source for testing can introduce inconsistencies. Despite implementation complexities, the preferred approach is to test directly with the production database. Reading data from the production database is straightforward if done with minimal read contention. However, testing write operations is trickier. One method involves creating dedicated testing users within the production database to operate on data without affecting real user data. This approach can be daunting as test operations run alongside live data.

Testing in Production Workshop

Testing in Production workshop aims to provide you with practical insights and hands-on experience in leveraging observability tools within a service mesh (Gloo Platform) to enhance your testing strategies in a production environment.

With the application deployed in the previous post of the series, you will deploy an app-v2 and test it in production using one of the strategies described in this post (dark launch).

The goal is to show how simple is to configure the Gloo Platform to execute these strategies and speed up the development lifecycle.

From the previous article, you learned how to leverage Observability and OpenTelemetry. Now, you will be able to use that capability to discover bugs in a testing application.

 

To start the workshop, please follow this link to the repository

Conclusion

Testing in production is crucial, and observability significantly enhances testing strategies in a service mesh environment (like Istio). It is important to strike a balance, understanding the benefits and risks of testing in a live production setting. By leveraging observability tools, developers can create robust testing practices that ensure the reliability and high performance of applications.

Embracing observability enables real-time insights, facilitates data-driven decisions, and empowers developers to address issues promptly. By adopting a thoughtful and informed approach, developers can embrace the power of testing in production while delivering exceptional user experiences.

In the final article, we’ll explain how all these practices blend into the big picture of app development.