Zero-downtime Upgrades with Istio 1.5 to 1.6 and Beyond
While the service mesh landscape has changed over the last few years, Istio has significantly grown in popularity. More and more organizations are running Istio in their production environments, and here at Solo.io we have been involved with many different organizations at various stages of their journey into production with Istio.
One of the main challenges with operating Istio is performing control-plane version upgrades, especially while trying to perform those upgrades with no downtime. While significant strides forward have been made to Istio’s upgrade workflow, it is a frequent source of frustration; even more so while working with older versions of Istio.
Istio 1.5 to 1.6 – a deep dive
Recently, we have been working with a customer currently running Istio 1.5.x in production and looking to upgrade. Given that this is production, the upgrade must be performed with no downtime. While the resulting process was a bit more manually involved than we would have liked, we successfully were able to upgrade from 1.5.10 to 1.6.14 with no interruption to their customer traffic. In this article we will explore the steps we took for this zero-downtime upgrade of Istio.
Baseline 1.5.x Installation
To begin, let’s setup our initial installation of Istio 1.5.10. For demonstration purposes, we will use a very minimal configuration file to give us an effective deployment of istiod and istio-ingressgateway:
# minimal-control-plane.yaml apiVersion: install.istio.io/v1alpha1 kind: IstioOperator spec: meshConfig: accessLogFile: /dev/stdout addonComponents: prometheus: enabled: false values: pilot: autoscaleEnabled: false gateways: istio-ingressgateway: autoscaleEnabled: false components: pilot: enabled: true k8s: resources: requests: cpu: 20m memory: 100Mi ingressGateways: - name: istio-ingressgateway enabled: true k8s: resources: requests: cpu: 20m memory: 40Mi
% istioctl manifest generate -f minimal-control-plane.yaml | kubectl apply -f -
% kubectl get deploy -n istio-system NAME READY UP-TO-DATE AVAILABLE AGE istio-ingressgateway 1/1 1 1 13m istiod 1/1 1 1 13m
A crucial piece of actually achieving zero-downtime is the use of custom user gateways, that is, ingress gateway deployments that are separate from the default istio-ingressgateway
deployment. Luckily for our customer, their initial deployment contained user gateways. Without this previously in place, it would be much more difficult to actually shift traffic as required.
To install the custom gateway, we will use a separate IstioOperator
resource to generate a manifest specifically for the custom gateway and apply it separately:
# minimal-gateways.yaml apiVersion: install.istio.io/v1alpha1 kind: IstioOperator metadata: name: custom-gw spec: profile: empty values: gateways: istio-ingressgateway: autoscaleEnabled: false components: ingressGateways: - name: istio-ingressgateway enabled: false - name: custom-gw namespace: default enabled: true k8s: overlays: - apiVersion: apps/v1 kind: Deployment name: custom-gw patches: - path: spec.template.spec.containers[name:istio-proxy].lifecycle value: preStop: exec: command: ["sh", "-c", "sleep 5"]
You will also notice an override to the istio-proxy
container to add a preStop
hook. This is an extra step for resiliency. See this GitHub issue for a brief discussion.
In this example, we are creating a user gateway named custom-gw
in the default
namespace. Let’s use this config to generate manifests for the user gateway and apply it.
% istioctl manifest generate -f minimal-gateways.yaml | kubectl apply -f -
Now we can confirm that we have created a Deployment
and Service
for this user gateway.
% kubectl get deploy NAME READY UP-TO-DATE AVAILABLE AGE custom-gw 1/1 1 1 19h details-v1 1/1 1 1 7d20h productpage-v1 1/1 1 1 7d20h ratings-v1 1/1 1 1 7d20h reviews-v1 1/1 1 1 7d20h reviews-v2 1/1 1 1 7d20h reviews-v3 1/1 1 1 7d20h
% kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) custom-gw LoadBalancer 10.84.26.233 34.69.221.142 15020:30476... details ClusterIP 10.84.19.186 9080/TCP kubernetes ClusterIP 10.84.16.1 443/TCP productpage ClusterIP 10.84.16.136 9080/TCP ratings ClusterIP 10.84.24.92 9080/TCP reviews ClusterIP 10.84.29.35 9080/TCP
The custom-gw
Service
is where our production traffic is flowing.
Additionally, a Gateway
resource has been created that we will associate with our VirtualServices
:
% kubectl get gateway -A NAMESPACE NAME AGE default custom-gw 5m4s istio-system istio-ingressgateway 22m
To test this out, we will use the standard “bookinfo” application. You can follow the standard documentation to install.
However, since we are using a user gateway, let’s change the VirtualService
to reference it:
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: bookinfo spec: hosts: - "*" gateways: - custom-gw http: - match: - uri: exact: /productpage - uri: prefix: /static - uri: exact: /login - uri: exact: /logout - uri: prefix: /api/v1/products route: - destination: host: productpage port: number: 9080
Now, you should be able to access the product page of bookinfo through our user gateway custom-gw
. As we upgrade, we need to ensure that the traffic flowing to the custom-gw
Service
does not get impacted.
Canary installation of 1.6.x
We are now ready to actually perform the upgrade. At a high-level the procedure is as follows:
- Install a canary 1.6 control plane
- Copy the
istio
ConfigMap
for the canary to the namespace containing the user gateways - Install a second set of user gateways associated with the canary control plane
- Label the new gateway pods so they are selected by the existing user gateway
Service
- Scale down the old user gateways
- Perform the rest of your data plane upgrade
- Uninstall the 1.5 control plane via manifests
First we will install the canary control plane. Note that we will be using the 1.6.14 istioctl
for these commands.
% istioctl manifest generate -f ../istio-1.5.10/minimal-control-plane.yaml --revision=1-6-14 | kubectl apply -f -
We will generate and apply the Istio 1.6.14 manifests using the same IstioOperator
config we used for our initial installation but will also set the --revision
flag in order to generate the various resources needed for a canary instance of the control plane. In this case the revision is 1-6-14
.
% kubectl get po -n istio-system NAME READY STATUS RESTARTS AGE istio-ingressgateway-7c96949f94-6ggst 1/1 Running 0 109s istiod-1-6-14-59c9f9fcf5-wmb2b 1/1 Running 0 106s istiod-5b588486c8-29flb 1/1 Running 0 67m
Post canary install, we now have a new istiod
deployment that contains our revision as a suffix. You can also see that the istio-ingressgateway-7c96949f94-6ggst
pod has a similar age as the new istiod
pod, i.e. the istio-ingressgateway
pod was bounced as part of the canary control plane installation. This is due to the manifests generated for the revision not generating a separate istio-ingressgateway
for the canary revision, one of several reasons why performing a true zero-downtime upgrade with the default ingress gateways is not easily achieved. With user defined gateways we have greater control over the manifests.
The next step is to spin up a set of user gateways that are associated with the newly installed canary control plane.
Before doing this, there is an important detail to consider.
In order for the user gateways to connect to the correct istiod
, the gateway deployment must be configured accordingly. For our 1-6-14
revision, the relevant configuration is in the istio-1-6-14
config map that is stored in the istio-system
namespace:
% kubectl get cm -n istio-system istio-1-6-14 -o yaml apiVersion: v1 kind: ConfigMap metadata: labels: istio.io/rev: 1-6-14 release: istio name: istio-1-6-14 namespace: istio-system data: mesh: |- ... defaultConfig: ... discoveryAddress: istiod-1-6-14.istio-system.svc:15012 ... ... ...
However, in our scenario, the user gateways are deployed to the default
namespace, not the istio-system
namespace. Since Deployments
can’t reference ConfigMaps
in a different namespace, we need to copy the ConfigMap
to the default
namespace. Without this step, the correct configuration will not get mounted into the gateway pods, resulting in the user gateways connecting to the original 1.5 istiod
, exactly what we are trying to avoid.
A straightforward approach is to simply get the correct config map from the istio-system
namespace, clean it up, and apply it to the default
namespace:
% kubectl get cm -n istio-system istio-1-6-14 -o yaml > istio-1-6-14-cm.yaml ## clean up state from config map such as annotations, resourceVersion etc. ## change the namespace to 'default' namespace % kubectl apply -f istio-1-6-14-cm.yaml
With the config map in place, we are safe to spin up the new user gateway pods. While we do this, we need to ensure the labels on the pod will be selected by the initial custom-gw
Service
. In other words, the original custom-gw
Service
will contain gateway pods synced to both the Istio 1.5 and 1.6 control planes.
The following IstioOperator
resource will correctly generate the user gateway manifests we need. Of note, we have defined a label
override and several patches in order to create a gateway deployment specific for the 1-6-14 revision while still allowing the pods created by this deployment to be a part of the original custom-gw
Service
. If you want to test the uptime, this would be a good time to generate traffic.
# minimal-gateways-1.6.yaml apiVersion: install.istio.io/v1alpha1 kind: IstioOperator metadata: name: custom-gw spec: profile: empty values: gateways: istio-ingressgateway: autoscaleEnabled: false components: ingressGateways: - name: istio-ingressgateway enabled: false - name: custom-gw-1-6-14 namespace: default enabled: true label: istio: custom-gw-1-6-14 k8s: overlays: - apiVersion: v1 kind: Service name: custom-gw-1-6-14 patches: - path: spec.selector value: app: istio-ingressgateway istio: custom-gw service.istio.io/canonical-name: custom-gw-1-6-14 - apiVersion: apps/v1 kind: Deployment name: custom-gw-1-6-14 patches: - path: spec.template.spec.containers.[name:istio-proxy].args.[custom-gw-1-6-14] value: custom-gw - path: spec.template.metadata.labels.istio value: custom-gw - path: spec.selector.matchLabels value: service.istio.io/canonical-name: custom-gw-1-6-14 app: istio-ingressgateway istio: custom-gw - path: spec.template.spec.containers[name:istio-proxy].lifecycle value: preStop: exec: command: ["sh", "-c", "sleep 5"]
% istioctl manifest generate -f ./minimal-gateways-1.6.yaml -r 1-6-14 | kubectl apply -f -
% kubectl get deploy NAME READY UP-TO-DATE AVAILABLE AGE custom-gw 1/1 1 1 19h custom-gw-1-6-14 1/1 1 1 130m details-v1 1/1 1 1 7d20h productpage-v1 1/1 1 1 7d20h ratings-v1 1/1 1 1 7d20h reviews-v1 1/1 1 1 7d20h reviews-v2 1/1 1 1 7d20h reviews-v3 1/1 1 1 7d20h
% kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) custom-gw LoadBalancer 10.84.26.233 34.69.221.142 15020:30476... custom-gw-1-6-14 LoadBalancer 10.84.21.41 35.224.185.31 15021:32672... details ClusterIP 10.84.19.186 9080/TCP kubernetes ClusterIP 10.84.16.1 443/TCP productpage ClusterIP 10.84.16.136 9080/TCP ratings ClusterIP 10.84.24.92 9080/TCP reviews ClusterIP 10.84.29.35 9080/TCP
Now we have the expected custom-gw-1-6-14
resources. We can confirm that the pod for the custom-gw-1-6-14
is being treated as an endpoint for the initial Service
. By checking the Endpoints
for the custom-gw
Service
, we can see that the newly created pod custom-gw-1-6-14-555579d667-wtz56
is included along with custom-gw-657bcffdc6-kvlmz
.
% kubectl get endpoints custom-gw -o yaml apiVersion: v1 kind: Endpoints metadata: ... name: custom-gw namespace: default ... subsets: - addresses: - ip: 10.76.5.85 ... targetRef: kind: Pod name: custom-gw-657bcffdc6-kvlmz namespace: default ... - ip: 10.76.5.88 ... targetRef: kind: Pod name: custom-gw-1-6-14-555579d667-wtz56 namespace: default ... ports: ...
To confirm our gateways are hooked up the correct pilot instances, let’s check the status of the proxies in our data plane:
% istioctl proxy-status NAME ... PILOT VERSION custom-gw-1-6-14-555579d667-wtz56.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 custom-gw-657bcffdc6-kvlmz.default ... istiod-5b588486c8-29flb 1.5.10 details-v1-78db589446-wcljf.default ... istiod-5b588486c8-29flb 1.5.10 istio-ingressgateway-7c96949f94-6ggst.istio-system ... istiod-5b588486c8-29flb 1.6.14 productpage-v1-7f4cc988c6-n2gf8.default ... istiod-5b588486c8-29flb 1.5.10 ratings-v1-756b788d54-x62ct.default ... istiod-5b588486c8-29flb 1.5.10 reviews-v1-849fcdfd8b-bjd6x.default ... istiod-5b588486c8-29flb 1.5.10 reviews-v2-5b6fb6c4fb-pc7k8.default ... istiod-5b588486c8-29flb 1.5.10 reviews-v3-7d94d58566-jtqtt.default ... istiod-5b588486c8-29flb 1.5.10
Now that we have the new pod taking traffic, we can spin down the user gateway deployment. Once we have done that, our production ingress traffic will have successfully shifted from a 1.5.x gateway to a 1.6.x gateway.
To finish the data plane upgrade, we need to inject new sidecars associated with the canary control plane. One method for this is via automatic sidecar injection and correctly labeling a given namespace with the desired control plane revision. For more info, see the official Istio 1.6 documentation for canary upgrades. It’s also important to confirm your deployments are capable of zero-downtime rollouts, whether through the use of preStop hooks, graceful shutdowns, etc.
Once out the various application pods have been recreated, we should be in a state where all data plane proxies (including the user gateway) are solely connected to the canary pilot instance; all except the stubborn default istio-ingressgateway
pod.
% istioctl proxy-status NAME ... PILOT VERSION custom-gw-1-6-14-555579d667-wtz56.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 details-v1-78db589446-l8l52.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 istio-ingressgateway-7c96949f94-6ggst.istio-system ... istiod-5b588486c8-29flb 1.6.14 productpage-v1-7f4cc988c6-w7ngw.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 ratings-v1-756b788d54-kn9t2.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 reviews-v1-849fcdfd8b-8ksm6.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 reviews-v2-5b6fb6c4fb-ksk2c.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 reviews-v3-7d94d58566-wlg2z.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14
As we are not sending application traffic through that gateway, it is safe to bounce the pod, resulting in a fully migrated data plane.
% istioctl proxy-status NAME ... PILOT VERSION custom-gw-1-6-14-555579d667-wtz56.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 details-v1-78db589446-l8l52.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 istio-ingressgateway-7c96949f94-8qnsp.istio-system ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 productpage-v1-7f4cc988c6-w7ngw.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 ratings-v1-756b788d54-kn9t2.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 reviews-v1-849fcdfd8b-8ksm6.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 reviews-v2-5b6fb6c4fb-ksk2c.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 reviews-v3-7d94d58566-wlg2z.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14
Clean up
Finally, we can clean up the original control plane and user gateways as they are no longer being used. Istio introduced an uninstall mechanism built into istioctl
with version 1.7.x. Since we are looking specifically at uninstalling the 1.5.x control plane, this won’t help us. For this scenario, we need to use the legacy method of generating a manifest based off our original IstioOperator
resource and using kubectl delete
.
Unfortunately for us, the manifest contains several resources that are still used by the Istio 1.6 installation, such as the custom resource definitions (CRDs) for Istio’s configuration. So in order to remove the old control plane, we need to extract the resources still being used by Istio and be sure not to remove them. The easiest approach will be to manually prune down the manifest. The details of this are left the readers imagination, but one example could be:
% istioctl manifest generate -f minimal-control-plane.yaml > istio-1.5-manifest.yaml # manually clean up the manifests to only include the components you wish to remove % kubectl delete -f istio-1.5-pruned.yaml
At this point the original control plane has been removed, but there is yet another problem to solve related to Istio’s validating webhook. Some care must be taken to ensure the webhook will still work, as this component is not intended to be versioned. Without doing so, you can no longer create new Istio resources such as VirtualServices
; in other words your control plane will be unusable. There are a few ways to workaround it, each with their own caveats and subtleties. For more info, you can follow the GitHub issue for this bug. If you are running into issues with this problem or filling in the blanks with any of the other steps outlined above, please reach out to us for additional assistance.
Wrapping up
Istio has gone through a few different iterations of installation options and approaches for upgrades. Combined with the various ways customers actually leverage and use these options, you can imagine how complex an environment like this could look. Operating Istio in production successfully, especially across multiple clusters, zones, and regions or clouds. With products such as Gloo Mesh, which is a production supported distribution of Istio (with SLAs, LTS, and specialty builds such as FIPS, ARM, etc), we hope to automate away this minutaie and enable you to focus on delivering value.