Technical

Tracking the Golden Signals for Istio’s Control Plane

One of the greatest benefits to running Istio is the detailed and insightful metrics, logs, and tracing that it produces for your services. The ability to add dashboards with p99, failure rate, upstream, and downstream failures to your developer experience tooling is a huge time saver for your service teams. No longer do they have to toil to generate a custom library that produces boilerplate metrics.

Image of the plan of Jeremy Bentham's panopticon prison as drawn by Willey Reveley in 1791. Depicts a circular building where all rooms are visible from the center

The plan of Jeremy Bentham’s panopticon prison as drawn by Willey Reveley in 1791. Source

 

While this is all well and good for observing your microservices, Quis custodiet ipsos custodes? Who is watching the watcher? How can you ensure that the control plane is monitored accurately too?

The Four Golden Signals

Image depicts an early 1918 signal lamp used by the army to send signals long distances through using a shutter

An early 1918 signal lamp used by the army to send signals long distances through using a shutter. Source

The four golden signals – latency, traffic, errors, and saturation (aka the four horsemen of your site outage) – are generally considered to be the way to observe and detect problems in your infrastructure as an SRE. Istio is no different, and out of the box there are a few things you should be observing to fulfill those metrics.

Latency

While the Istio control plane is not going to be routing traffic through IstioD, it still has a lot of critical work to do. This comes in the form of the time it takes for Istio to take in xDS push commands. Istio Pilot is taking on constant events (pod creation, service discovery changes, etc) and after a default of 100ms (see PILOT_DEBOUNCE_AFTER), it collects those events and pushes. All the affected Envoy sidecars then pick up those changes and update via xDS. This metric can be observed in total with pilot_proxy_convergence_time. This is the total time from event creation to convergence to push to Envoy xDS update. The time between that push and it propagating to the rest of the cluster is the propagation delay. If this gets too large you can end up in a state where your mesh is under constant change. Pushes then become so latent that configuration can become in flux and your services will return 5xxs. Understanding this metric, and its history, lets you tune the frequency of pushes for your service mesh.

Traffic

As noted above, the Istio control plane is always listening to events, merging them, then pushing to Envoy to keep them updated. While your microservices are at a steady state, with little change in pods, deployments, or services, there is not much for the control plane to do. That is a rare state though. One of the draws to a microservice architecture is constant deployment, coupled with some of the stronger points of Kubernetes: automated recovery and scaling. In reality your cluster in production is likely undergoing constant change. These events can pile up and, the more services you have multiplied by the more that those services are interconnected to each other, creates more events. If you have no service isolation (see sidecar configuration), this means that the amount of events can be overwhelming. Fortunately, you can observe this with pilot_xds_pushes. Understanding why and when this increases can help you scope out what services you need to isolate to continue to scale out more services over time.

Errors

While this may seem like the most obvious (just log errors from IstioD!), we can do better! While obviously logging errors is critical and should be done, you may want to to also keep track of pilot_total_xds_internal_errors as well as periodically running istioctl analyze. Remember your old friend istioctl? Which you probably haven’t touched since doing some early Istio installations, left to collect dust after you moved to helm? Well it has some really great debugging tools, not the least of which is istioctl analyze, which will help you see any configuration errors in play that won’t necessarily show up as failures anywhere else (can’t report what you can’t see).

Saturation

So we know how many events we are getting, and how long it takes to propagate, how often it errors, but how much can our mesh take before falling over? While there is no easy linear way to answer this question, there is a very real breaking point in how much one IstioD pod can handle. In addition to that, somewhat contrary to most Kubernetes deployments, horizontal scaling isn’t an easy and immediate fix. In fact it can be detrimental to immediately scale horizontally (becoming an N+1 problem as you add more and more work for each subsequent istioD pod to sync with every other IstioD pod spun up). Your best bet here is the traditional CPU and RAM usage metrics on the actual pod. Using that information combined with the above (specifically pilot_proxy_convergence_time and pilot_xds_pushes), it’s easy to see when you are hitting the breaking point of what the istioD pod is provisioned at. Saturation can be seen in the form of IstioD consuming more and more CPU to process larger and larger queues until you hit a breaking point. If you are observing the traffic (pilot_xds_pushes) and the latency (pilot_proxy_convergence_time), increasing unabated you will find yourself in this state.

Listening to the Signs

Image depiction of a Morse Key pad used for transmitting Morse Code over telegraph lines

Morse Key pad used for transmitting Morse Code over telegraph lines. Source

So now you know what to look out for before being confronted with a problem. Ultimately, this is a great exercise in prevention. Being able to see these things coming, warning early on outliers, and having a historical record of levels is paramount in staving off an outage.

What you do to control your infrastructure is up to you. For some places, restricting by sidecar resource is prohibitive and so the cost comes in high CPU for istioD, for others high latency in xDS pushes is not a concern, and they may ratchet up the amount of events batched before sending out.

Everyone is going to have different solutions, but the underlying metrics that inform those decisions are the same. Having these metrics in a dashboard and tracking them over time is going to help you understand your scaling. The reality is this is a constant battle; in most cases the number of services in a mesh only goes up over time. Having a game plan and historic data to see where you may run into problems is a key factor in keeping your service mesh healthy and scaled appropriately.