KubeCon EU 2019 Talk Recap — Chaos Debugging

At this year’s KubeCon EU in Barcelona, our CEO and founder Idit Levine and I got to present together on the topic of Chaos Debugging. Chaos engineering is a concept popularized by the Netflix engineering team as a way to seek and eliminate weak points in distributed systems before they become a problem for the end users of the application. Although most application teams won’t experience the kind of peaks in scale like Netflix will when a new series drops, these concepts apply to anyone building microservices.

What’s Changed: Distributed microservices are great to quickly innovate and ship new features more often but they also present new challenges because existing developer tools were not designed for this type of environment. As the application architecture evolves to be distributed, loosely coupled and ephemeral services, so must the stress testing and debugging framework.

This presents us with three new questions for distributed environments:

  1. How to debug microservices?
  2. How do you debug microservices issues in production?
  3. How do you proactively prevent application issues?

In the session, we presented a “Chaos Debugging” framework enabled by the three technologies to directly address those questions: Squash, a microservices debugger, Gloo Shot, a chaos orchestration tool, and Loop, a traffic replay tool. Squash and Gloo Shot are available in open source and Loop is not yet released.

Squash solves the question of “how do you debug microservices” by bringing the strength of modern debuggers and the convenience of their IDEs to microservices developers. Squash uses popular, powerful and mature debuggers (gdb, dlv, java debugging) and integrates them seamlessly with Kubernetes and containers. Once Squash is installed, it presents you with a list of all the containers running in your cluster, attaches a debugger to the process of your choice, and allows you to debug in a Kubernetes environment just as you would debug in your local environment.

Loop addresses concerns around debugging in a live production environment by doing a record and replay of production in a sandbox. Loop does this by extending the service mesh to observe your application and record full end to end transactions. In the case of an anomaly, Loop saves the recording, snapshots the environment state and data, then spins it up in a separate sandbox with Squash for replay and debugging.

Gloo Shot helps teams improve the application resilience and hardens the environment with chaos engineering experiments. Gloo Shot integrates with all major service meshes to implement advanced and realistic stress and failure tests in a controlled manner. Gloo Shot’s systematic investigation framework helps you increase your microservices’ “immunity” to issues.

When used together, Squash, Loop and Gloo Shot provide a powerful set of tools for developers to build and maintain stronger, more resilient microservices. Try out the projects, give us feedback, contribute or star your favorite repo.

For more info: