Solving Microservice Mysteries With Envoy’s Tap Filter

Solo.io Engineering | November 22, 2019

How to reproduce and resolve seemingly random intermittent server failures in O(1) time.

2019 has been a big year for Envoy:

Continuing with the observability and customization themes, my personal favorite new Envoy feature is the Tap Filter. I’ll show you how to use it and encourage you to do so if you are not already.

What is the Tap filter?

The Tap filter allows you to capture a request and its response when certain criteria are met. The captured data can be streamed to another service or stored in a filesystem. Unlike other tools that are safe to make widely available to your dev teams, the Tap filter captures a full copy of all data included in the request and response. The match criteria is very expressive so you can pinpoint the traffic that you are interested in investigating. You can limit the capture size to avoid wasteful captures, defaulting to 1 kibibyte buffered per body.

There is also a tap filter for TCP traffic (envoy.transport_sockets.tap) but I’ll just talk about http taps today.

Here is a minimal config to get started. This tells Envoy to capture all http traffic and write it to the taps directory in files prefixed with the “any” string.

There are several powerful use cases for the Tap filter. Since naming things is much of the fun in writing code, and since the Tap filter has such a great name, let’s dive into the etymology while we review some use cases that can accelerate your development team.

Tap: sample live data flows

Imagine that your service is embedded in a complicated architecture that feeds and manages a large reservoir of data and data flows — something like this brewery schematic.

Typical micro(brew)service

You want to make sure that the quality is maintained without contaminating the mixture.

A subcomponent providing a particular service

A tap allows you to draw a representative sample of the real product without imposing a measurable impact on the system.

Tapping the system for quality control

Specifically, you can configure Envoy to capture taps only when a particular request header is sent. This means you can control exactly when a record is made by injecting the particular header.

Coming soon, my teammate Yuval, is working on an API to enable fractional sampling (in addition to an enhanced dynamic configuration API). With fractional sampling you can easily keep a record of system behavior over a span of time.

Here is an example of header-triggered Tap collection:

Tap: robust configurable observability

Imagine instead that your system is a highly modular, reconfigurable, and scalable microservice. Many small configurable components work together to deliver a tuned system.

Highly parametric, well-modeled microservice

As with an optics bench, you want to be able to assemble and disassemble your system, swapping out and reusing components to deliver solutions faster and with greater accuracy.

An optics bench is a sort of container orchestration platform

Tapping your system, rather than riveting or welding components gives you traction without any significant overhead. Like a mechanical tapping tool, the Tap filter supports highly configurable workflows:

Taps provide traction and system flexibility

As shown here you can include multiple tap specifications in a single filter chain. This may be useful if you want to capture standard snapshots across your organization in addition to team or feature-specific snapshots.

Alternatively, you can deploy the Tap filter on a sidecar envoy. This allows you to keep your route config separate from your tap specs and makes it easier to reuse common configs.

My teammate Eitan recently implemented a Web Application Firewall Filter for Envoy and found the Tap filter useful for doing development on Envoy itself. We anticipate this being a valuable tool during the creation of WASM filters.

Tap: assistance in error recovery

Where I think the Tap filter best takes its name is from its use as a disaster mitigation tool. Imagine that your system is a key part of a life support system, such as this astronaut’s breathing apparatus.

A microservice with high criticality and slow iteration cycle

Making things worse, mission critical systems tend to have slow iteration cycles.

What could go wrong?

If anything goes wrong, you will be haunted by the failure. Wouldn’t you rather the spirit take the form of a helpful poltergeist and “Tap” you on the back with a clue towards disaster recovery?

Don’t let your requests error in vain — learn from them.

This behavior can be achieved easily with the config shown here. When a response code has a 500 Internal Server Error status, the input is recorded.

Additionally, it is worth noting that Envoy’s match criteria are recursive, meaning they support highly specific filtering. If, for example, you release a version 1.0 of a product and want to watch for ISEs produced by this latest release, you can easily do so.

Software Engineering Mythologies Volume I: On User Education

While we’re talking about exciting new software features and how to adopt them into your workflow, it’s worth reflecting on some of the associated lore.

One of my favorite stories from software engineering mythology is that the real reason Microsoft shipped games with their operating systems was to educate users of new features.

  • Minesweeper taught the “right click”.
  • FreeCell taught the “drag and drop” gesture.
  • And I like to think that Crackdown 3 was made to demonstrate the availability of integrated cloud computing.

“Toys are tools for learning “ — Barry Kudrowitz

When I think about this I can’t help feeling like I’ve been played. On the other hand, I wonder if there are other things I don’t know I don’t know I should be playing right now.

Just look at how many features Envoy has now, and how many more are on the horizon with the growing community, newly supported platforms, and WASM. We enough features to fill an arcade.

The fact that networked game engines use (or should be using) Envoy as a platform aside, the Envoy feature set alone could make a great mini-game arcade.

Software Engineering Mythologies Volume II: On User Experience

On the dark side of the relevant software engineering legends, there’s a story about Steve Jobs that I can never forget. He asked his team how hard they would work to shave a second off of the Mac’s boot time if it meant saving someone’s life.

Pretty hard

He then noted that, given the millions of users, the cumulative time spent waiting for startup amounted to several human lifespans.

In this sense, even if we are not developing breathing apparatus for astronauts, the systems that we develop concern the realities of life and death.

The faster we can find and fix bugs in our systems, the fuller everyone can experience life and growth. There is a lot of work going into making systems more robust and accelerating feature delivery. It is what everyone goes to KubeCon to find. And still — whenever something goes wrong in a microservices environment, you need to become a detective.

This is exactly where the Tap filter can help, so let’s get started with a concrete example.

Mystery Service Game

For the sake of discussion, let’s say you have nine services deployed in six different cloud environments, coordinating over six API versions. That alone, just like the game of Clue, is 324 different combinations. If only debugging our systems were so simple.

I implemented a simple version of Clue, currently hosted on envoy.games, to help you kick the tires on the Tap filter and to showcase the Tap filter’s value. Like real tested and deployed systems, the service only crashes under certain circumstances which you have not yet identified and handled. In the game, you can explore the house with various identities and objects.

Simple game app to demonstrate the challenge of reproducing and resolving intermittent failures. Of course, real systems have nearly infinite combinations of possible input.

The app is stable until you strike upon the “crime” — guessing correctly produces a 500 (Internal Server Error) response code on the back end and a (very scary) notification that the infrastructure team has been paged on the front end.

No user or dev team wants to see this. Fortunately, the Tap filter can help expedite the resolution.

Fortunately, since we had configured the Tap filter to watch for errors such as this, we can immediately review the contents of the request body and more swiftly implement the fix.

The tap file will be captured in the specified output directory, in a file named with the prefix and a timestamp. For example, an output config with a per_file_tap path_prefix of “taps/error” would write a file like this: taps/error_3319132123456.json.

An example Tap file is shown below, note that the headers and body from the request and response have been captured.

In this Tap config, I chose to encode the body as base-64 bytes rather than as text. This is recommended to avoid issues with special characters and non-text data. It is easy to recover and format the (in this case, json-encoded) content. With the command:

We see that the body of the bad request contains:

This tells us that our service has an issue with this particular input and we should review the code path that handles inputs like this. As far as the mystery service game goes, we know that A committed the crime in room B with object C.

For completeness, let’s decode the response body and see what our services final words were (I don’t expect the output to have been JSON formatted so I will not format the output with jq):

We see that the body of the response to the bad request contains:

Hmm, looks like we could have written a better error message. Fortunately we were able to get all the information we need from the Tap filter!

Using the Tap filter in production

The Tap filter is a robust, core Envoy feature. You can think of it like the shadowing capability, but with arbitrary request and response matching capabilities and more options regarding the output sink. I have included some example configs in the mystery service repo to help get you started and the Envoy Tap docs cover the complete configuration details.

At Solo.io we are working extensively with Envoy in our API Gateway Gloo and in a production-grade traffic capture and replay tool called Loop (built on top of the Tap filter). You can learn more about Loop from our presentation in KubeCon Europe 2019 (slides).

The contents of this article were presented at EnvoyCon 2019, in a talk titled “Solving Microservice Murder Mysteries with Envoy’s Tap Filter” (recording).

Please feel free to reach out on slack or twitter if I can help clarify anything, thanks!

Back to Blog