In 2023, A Service Mesh Should Not Be Complex

As a developer building APIs or consuming them, you probably care a lot about what language you use, libraries, and things like your favorite HTTP client. You plug in a URL/context path and consume the data from the API. You probably don’t care how the connection gets created, routed, secured, tunneled, etc. under the covers. And this is where a service mesh comes into the picture. It can offload these responsibilities to the networking infrastructure. Instead of “implementing these concerns” directly in your code, these behaviors can be configured as part of the infrastructure.

Although service mesh capabilities such as secure, authenticated, authorized communications, resilient failover/fallbacks, and metrics/monitoring are powerful and help developers accomplish their duties, the API used to configure these capabilities is important. Using low-level network APIs to configure these behaviors may appear to be too complex for developers–and they are.

As powerful as networking is, it should be abstracted to the right levels. Networking can be very complex (have you pulled back the covers on what the network is doing to get an HTTP request through?!), and someone definitely needs to understand it in detail, but probably not developers.

On the other hand, platform engineers, SRE teams, or security and network teams should understand the layers in the network so they can correctly monitor, configure behaviors and policy, and more importantly, debug things when they go wrong. This is no different for service mesh technology.

Platform engineers

Istio has gained a reputation for “being complex”. Part of that reputation was gained during the maturation process of the project (as well as the industry) where things didn’t always work, features regressed, or implementation details leaked through. Istio has a large feature set as well, which can lead to uncertainty about where to start. Another reason is—early on in the project—user experience (day 1 and day 2) didn’t get a lot of attention.

A lot of work has since gone into significantly improving the UX, feature and API stability, and generally reducing friction when using the project. This work has manifested itself in Istio being the most mature, most deployed at scale service mesh in the world.

I would argue one of the biggest contributors to this reputation was circumstantial: users would bootstrap a Kubernetes cluster, slap Istio on it, and tell developers “Go have at it”. This is a mistake. Istio’s API was not intended to be used by end-developers. Istio’s API is a powerful API that should be automated by some higher-order platform (the same is true about Kubernetes, BTW).

Further Reducing Complexity

To date, there are still some complexity concerns that follow Istio. For example, Kubernetes has not treated “sidecar containers” as a first class citizen (until recently) which causes issues when container ordering is significant. Since the sidecar must trap all traffic ingress/egressing the pod, there are complex IPtables rules that must be in place which can trip up some types of applications. Kubernetes Jobs don’t gracefully terminate when a sidecar is present. Sizing resources (limits and requests) can also be a bit tricky. The sidecar requires its own specific resources, however, if you’re going to rely on resource consumption for autoscaling, things can get thrown off. Whenever a developer must know that the sidecar is in the picture and can trip up the application, Istio becomes a bit of a leaky abstraction, and it’s a recipe for intermittent complex edge cases cropping up.

At Solo.io, we helped create Istio Ambient Mesh—and are driving it to GA in the open source project. Ambient mesh eliminates the need to run a sidecar in the application’s Pod and pushes networking policy, behavior, and concerns to where it belongs: in the network. Ambient mesh places control points within the request path to implement the service mesh behavior but is “ambient” or transparent to the applications.

CNI

Istio Ambient ExplainedWe at Solo believe Istio Ambient Mesh is a major step toward removing the last pieces of complexity that end users have to think about when using Istio. To learn more about Istio Ambient Mesh, take a look at the book Lin (@linsun_unc) and I (@christianposta) wrote: Istio Ambient Mesh Explained.

We don’t stop there, however. We leverage Istio (we have our own enterprise, LTS, ARM, and FIPS certified builds—that are drop-in replacements for the Istio community builds—i.e., we do NOT fork Istio) in a prescribed way that adds a higher-level API that better addresses end-user and developer needs while shielding the lower-level Istio API.

For example, a developer will likely care about resilience aspects for its service’s communication (e.g., timeout, retry, circuit-breaker). In the Istio API, those concerns are nested in both the VirtualService and DestinationRule API which also controls just about every other feature in the mesh. Misconfiguring this can cause massive issues. With Solo, we introduce a RetryTimeoutPolicy object that allows developers to scope down and focus on just those pieces that they care about, or apply labels to their services to automatically include them in specific platform-wide policies, and let the higher-order automation engine configure Istio’s lower-level API.

Another example is the FailoverPolicy object that allows SRE teams to specify exact behaviors for multi-cluster/zone/region failover. The last example is an object to manage and define tenancy within Istio called the Workspace object, which gives very fine-grained RBAC over making policy changes.

In 2023, Running a Service Mesh Is Straightforward with Solo

In short, running a service mesh in 2023 should be as simple as possible: and Solo.io has made that the case. Removing the infrastructure pieces from the application Pod, avoiding forcing the end-developers to understand and configure Istio’s low-level API, and providing additional tenancy, failover, and metrics functionality that large-scale deployments of service mesh require is a large step in the direction of an [ambient] mesh everywhere. Istio Ambient mesh, and the most feature-full release of Gloo Mesh will be available in the upcoming 2.4 release (becoming available in a few weeks!). Join our Slack or reach out to our technical teams to learn more.