Autopilot: an operator framework for building workflows on top of service mesh

Autopilot is a recently announced and open-sourced project from Solo.io that gives you a framework for building opinionated operators for automated workflows on top of a service mesh. These types of workflows typically take signals or telemetry from the environment to decide what action to take next. Just like a “pilot” observes their surrounding and makes decisions on how best to guide the air plane, an autopilot automates those decisions.

At Solo.io, we believe the true power in service mesh comes from their respective programmable interfaces. Autopilot allows you to automate the service mesh interface to do interesting things like canary automation, chaos experimentation, adaptive security, and more. In the past, doing so would have been brittle, hand-crafted and bespoke. Let’s take a closer look.

Autopilot in action

Autopilot lets you define the states for your automated workflow, and generates the scaffolding for the controller that lets you plug in your business logic. With this new project, you define a new Custom Resource Definition (CRD) which will be used to configure the controller.

Autopilot init

The best way to understand Autopilot is by example. Download one of the latest releases and follow along using ap CLI.

The first thing we need to do is initialize our new project with ap init

ceposta@postamaclab(src) $ ap init example && cd example
INFO[0000] Creating Project Config: kind:"Example" apiVersion:"example.io/v1" operatorName:"example-operator" phases: phases: phases:
go: creating new go.mod: module example

This init step created a couple initial resources that are used to define the state-machine for our controller loop:

The key file from this list is the autopilot.yaml file which defines the Autopilot “phases” or set of states the control lop can be in. Let’s take a look:

apiVersion: example.io/v1
kind: Example
operatorName: example-operator
phases:
- description: Example has begun initializing
initial: true
name: Initializing
outputs:
- virtualservices
- description: Example has begun processing
inputs:
- metrics
name: Processing
outputs:
- virtualservices
- description: Example has finished
final: true
name: Finished

Here we see three phases: Initializing, Processing, and Finished. If you were building a control loop for driving a canary release, you might have something like: Initializing, Waiting, Promoting, Rollback, etc with transitions from one to the other as well between each other where appropriate.

Notice in the phase definition, we specify what the inputs and outputs that will drive the parameter set that will drive the business logic behind each phase.

Autopilot generate

Once we’ve defined the phases for our control loop, we can generate the rest of the scaffolding for our project.

ceposta@postamaclab(example) $ ap generate

This should give our directory some generated code that becomes our controller:

ceposta@postamaclab(example) $ ls -l
total 152
-rw-r--r-- 1 ceposta staff 119 Nov 17 09:49 autopilot-operator.yaml
-rw-r--r-- 1 ceposta staff 376 Nov 17 09:49 autopilot.yaml
drwxr-xr-x 4 ceposta staff 128 Nov 17 10:06 build
drwxr-xr-x 3 ceposta staff 96 Nov 17 10:06 cmd
drwxr-xr-x 12 ceposta staff 384 Nov 17 10:06 deploy
-rw-r--r-- 1 ceposta staff 2133 Nov 17 09:49 go.mod
-rw-r--r-- 1 ceposta staff 64681 Nov 17 10:06 go.sum
drwxr-xr-x 4 ceposta staff 128 Nov 17 10:06 hack
drwxr-xr-x 7 ceposta staff 224 Nov 17 10:06 pkg

There are a couple important concepts to know about once the code has been generated:

the Spec defining your CRD
workers for each of the defined phases
the scheduler that coordinates the workers

In the next section, we’ll take a look at these concepts.

Implementing the brains of our controller

At this point, we have a fully functioning auto-pilot controller, but it doesn’t do much yet. The first thing we will want to do is define what the Custom Resource Definition should look like. For example, when building a Canary automation system, you may want something that specifies how to unroll the canary deployment including how frequently to do so, what telemetry to observe, and what success looks like.

Building the custom resource definition

Take, for example, the following CRD that defines some basic canary-release configurations:

apiVersion: autopilot.examples.io/v1
kind: CanaryDeployment
metadata:
name: example
spec:
ports:
- 9898
successThreshold: 100
measurementInterval: 1m
analysisPeriod: 10s
deployment: example

To add this CRD to our project, we need to fill in the $BASE/pkg/apis/example/v1/spec.go file. Right now it looks like this:

package v1
type ExampleSpec struct {
// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
}

To build the CanaryDeployment CRD, we could add something like this:

package v1
type ExampleSpec struct {
v1.DeploymentSpec
// ports for which traffic should be split (between primary and canary)
Ports []int32 `json:"ports,omitempty"`
// Over what interval should we measure the success rate?
MeasurementInterval metav1.Duration `json:"measurementInterval"`
// Canary must maintain [a success rate metric]() or for the given analysisPeriod
SuccessThreshold float64 `json:"successThreshold,omitempty"`
// How long should we process the canary for before promoting?
AnalysisPeriod metav1.Duration `json:"analysisPeriod,omitempty"`
}

Building the workers

Now that we’ve specified the CRD that will drive the canary automation control loop, we need to fill in the custom code for each one of the phases and the transitions between phases.

If we look in $BASE/pkg/workers we see packages for each of the phases we defined in the autopilot.yaml before generating the code for our project. Once we generated the code, we have stubs for the workers. For example, in the worker for the Initializing phase, we see:

package initializing
import (
"context"
"github.com/go-logr/logr"
"github.com/solo-io/autopilot/pkg/ezkube"
v1 "example/pkg/apis/examples/v1"
)
// EDIT THIS FILE! THIS IS SCAFFOLDING FOR YOU TO OWN!
type Worker struct {
Client ezkube.Client
Logger logr.Logger
}
func (w *Worker) Sync(ctx context.Context, example *v1.Example)
(Outputs, v1.ExamplePhase, *v1.ExampleStatusInfo, error) {
panic("implement me!")
}

We can fill in the details of our workers making sure to respect the inputs and outputs that we defined in the autopilot.yaml file. When we return from a worker, we also want to pass back the next phase that should be triggered, or a reference to the current phase to indicate not transition needs to take place.

In the accompanying video demo, we explore building the workers.

Exploring the scheduler

The scheduler implements the controller-runtime Reconcile() function which watches the CRD and runs a reconciliation against any of the changes. The scheduler determines the current phase and calls the workers. The workers return the next phase, if applicable, and the scheduler calls the next worker. This continues until the control loop reaches the final phase. Here’s an example of the auto-generated scheduler for the Initializing phase:

switch example.Status.Phase {
case "", v1.ExamplePhaseInitializing: // begin worker phase
logger.Info("Syncing Example in phase Initializing", "name", example.Name)
worker := &initializing.Worker{
Client: client,
Logger: logger,
}
outputs, nextPhase, statusInfo, err := worker.Sync(s.ctx, example)
if err != nil {
return result, fmt.Errorf("failed to run worker for phase Initializing: %v", err)
}

for _, out := range outputs.VirtualServices.Items {
if err := client.Ensure(s.ctx, example, &out); err != nil {
return result, fmt.Errorf("failed to write output VirtualService<%v.%v> for phase Initializing: %v", out.GetNamespace(), out.GetName(), err)
}
}

// update the Example status with the worker's results
example.Status.Phase = nextPhase
if statusInfo != nil {
logger.Info("Updating status of primary resource")
example.Status.ExampleStatusInfo = *statusInfo
}

Note, the scheduler is auto-generated and is not intended to be edited by had. The code is influenced by the autopilot.yaml inputs and outputs specified for each phase.

Once we’ve filled in the workers, we can build and deploy the project with ap build and ap deploy. See the accompanying videos to see how this works.

Follow along with demo!

In this series of videos, we walk through each of these steps and build a new controller using auto-pilot that builds a Canary Automation controller for Istio:

https://www.youtube.com/watch?v=cD74L8cPeBY

https://www.youtube.com/watch?v=5DPs9zksBJg

https://www.youtube.com/watch?v=hcfSHFslo1

https://www.youtube.com/watch?v=h3HUFMP_ej8

For more:

See the following resources for more:

Autopilot: an operator framework for building workflows on top of service mesh

Autopilot in action

Autopilot init

Autopilot generate

Implementing the brains of our controller

Building the custom resource definition

Building the workers

Exploring the scheduler

Follow along with demo!

For more:

Featured content

Tracing GenAI Applications Is Not Enough

Gloo Mesh 2.10: More Secure, Scalable Cloud Connectivity

MCP Authorization is a Non-Starter for Enterprise

Securing and Observing Your Services, Simplified

From MCP Servers to Services: Introducing kmcp for Enterprise-Grade MCP Development

The Power of a Single API to Secure, Observe, and Control Traffic in All Directions

Why Building Large Kubernetes Clusters Is (Still) a Bad Idea

Fortifying Your Cloud Native Connectivity Security Posture with Solo and Ambient Mesh

Migrating from Sidecars to Ambient Mesh - Risks, Challenges, and Benefits

Overhaul of Agent Gateway supporting A2A, MCP, and Kubernetes Gateway API

How Ambient Mesh Delivers Advanced Resource and Cost Savings

Getting Started with Ambient Mesh: From 0 to 100 mph

Agent Discovery, Naming, and Resolution - the Missing Pieces to A2A

Part Two: MCP Authorization The Hard Way

Part One: MCP Authorization The Hard Way

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

Ingenico

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Gloo Mesh Lab: OpenTelemetry collectors and relay

Gloo Mesh Lab: Extended telemetry from ztunnel