Introducing Autopilot — An open source project for adaptive service mesh

The many benefits of service mesh come from it taking full control of the communication network within a cluster. However, this increases your vulnerability to misconfiguration and human errors. There are fantastic tools to simplify service mesh configuration and improve resilience, but they all depend on continuous actions from a human operator. We believe that these tasks should be automated, and propose the notion of an adaptive mesh, a mesh that continuously senses changes within its environment and automatically adjusts to them. Today we are announcing Autopilot, an open-source project that turns your service mesh into an adaptive service mesh by building and deploying Service Mesh Operators.

The great advantages of a service mesh do not come without risk

A service mesh is an infrastructure layer that handles service-to-service communication. Service mesh abstracts the network to provide advanced capabilities, including encryption, authentication and authorization, routing, monitoring and tracing — and hides that complexity from your application.

Because the service mesh bears the full responsibility for routing all traffic within your cluster, “with great power comes great responsibilities”. As the only controller of the in cluster network traffic, an incorrect or outdated configuration can lead to severe degradation of the application performance, compromise its security and make it vulnerable to external attacks, or — in the worst case — even bring the entire network down. What becomes critical is the flawless configuration of the service mesh and this raises the crucial question:

How do you make sure your service mesh is resilient?

The community offers several approaches to enhance the resilience of the service mesh.

The Service Mesh Interface (SMI) simplifies your mesh configuration. The significance of correct configuration first appears during the initial installation of the service mesh. To establish a clear, convenient, and safe configuration process we announced with leaders in the service community — the Service Mesh Interface (SMI), a specification that covers the most common service mesh capabilities. Being Kubernetes native and provider agnostic, the SMI defines a common standard for service meshes.

The Service Mesh Hub discovers and validates your mesh configuration. In May 2019, we introduced the Service Mesh Hub (previously known as SuperGloo launched in 2018), an open-source abstraction layer that implements the SMI and automates the installation and management of all service meshes in your cluster. The Hub installs any mesh and automatically discovers all existing meshes. Being aware of your entire cluster, Service Mesh Hub is tasked with continuously validating the mesh configuration and making it easy to safely change the configuration.

Using GitOps pattern to automate the configuration process to reduce the chance of error. As your cluster, application, and ecosystem evolve, you may need to change your service mesh configuration to match these changes. Or you may want to add third-party extensions to your service mesh over time. The Service Mesh Hub enforces a GitOps pattern for making these changes, essentially treating your mesh configuration the same way you treat your code, by maintaining a repository and forcing PRs. The Service Mesh Hub provides a single pane of glass for monitoring, installing and managing the service mesh and its extensions. The combination of a highly automated processes and the GitOps pattern is designed to minimize human errors that could put your service mesh at risk.

An adaptive service mesh automatically adjusts its configuration to the changes in the environment

A microservices environment is highly dynamic by nature, to reflect changes in the infrastructure, deployments of new business functions, and enhancements to privacy and security; while other changes may be adversarial, including deliberate attempts to breach the system. The successful operation of a service mesh, therefore, requires continuous monitoring, rapid identification of changes that require intervention, and expeditious execution of an optimal response.

Many tools are available that allow a human operator to monitor and adjust service mesh configuration. However, the automation of these processes is required to ensure that a service mesh is consistently healthy and performant. A service mesh that automatically adjusts its configuration to changes in the environment — which we call an adaptive mesh — relieves the end user from the need to continuously monitor the service mesh, accelerates the response, and prevents end user errors.

For example, an adaptive service mesh will automatically identify security vulnerabilities and isolate the compromised services to protect the rest of the environment. During a canary deployment, the adaptive service mesh automatically controls the ratios between new and stable versions based on performance. Upon checking in code to git, the adaptive service mesh will automatically create a route to the new service.

Implementing an adaptive service mesh using service mesh operators.

Operators are a popular pattern for automating repeatable tasks in service management beyond what is provided by Kubernetes itself. The same pattern can be used to automate service mesh management and integrate service mesh capabilities with core Kubernetes features.

Building Kubernetes operators is facilitated by available SDKs, such as the Operator Framework by RedHat and the Kubebuilder SDK Kubernetes-SIG. These simplify and accelerate the development of Kubernetes Operators. However, their domain knowledge ends with vanilla Kubernetes, providing no out-of-the-box integration points with service meshes. The work of doing so falls upon the developer.

Introducing Autopilot: A Service Mesh Operators Framework

For this reason, we at Solo.io built the Service Mesh Autopilot, an opinionated SDK and toolkit for developing and deploying Service Mesh Operators. By treating service mesh as a first-class concept, Autopilot makes it easy to build Service Mesh Operators that automate and extend service mesh in the same way Kubernetes Operators automate and extend Kubernetes. Autopilot generates scaffolding, builds, and deploys Operators which run against a local or remote Kubernetes cluster with a service mesh installed.

Autopilot implements a control loop, composed of Watchers which provide the input, a state machine that provides the brain, and Workers that perform the required actions. Watchers are service mesh-specific sensors that can follow the service mesh metrics, CRDs, webhooks, etc. End users are asked to define the possible states and the transition rules between them, triggered by events that are generated by the watchers. End users also provide the workloads to be executed by the Workers. Workers that want to make changes to the configuration of the service mesh can either do it directly or follow a GitOps pattern and send them as PRs.

At Solo.io we originally developed Autopilot to streamline our own development process. Because Autopilot is self-generating, we were able to accelerate the development of Service Mesh extensions from months to days.

We invite the community to try Autopilot, join us in identifying more scenarios, more operators, and more features.

A technical introduction to Autopilot, along with demos of several use cases, can be found here. We encourage you to check out our GitHub repo, the docs, and join our slack channel. Happy KubeCon from Solo.io!