Zero-downtime Upgrades with Istio 1.5 to 1.6 and Beyond
While the service mesh landscape has changed over the last few years, Istio has significantly grown in popularity. More and more organizations are running Istio in their production environments, and here at Solo.io we have been involved with many different organizations at various stages of their journey into production with Istio.
One of the main challenges with operating Istio is performing control-plane version upgrades, especially while trying to perform those upgrades with no downtime. While significant strides forward have been made to Istio’s upgrade workflow, it is a frequent source of frustration; even more so while working with older versions of Istio.
Istio 1.5 to 1.6 – a deep dive
Recently, we have been working with a customer currently running Istio 1.5.x in production and looking to upgrade. Given that this is production, the upgrade must be performed with no downtime. While the resulting process was a bit more manually involved than we would have liked, we successfully were able to upgrade from 1.5.10 to 1.6.14 with no interruption to their customer traffic. In this article we will explore the steps we took for this zero-downtime upgrade of Istio.
Baseline 1.5.x Installation
To begin, let’s setup our initial installation of Istio 1.5.10. For demonstration purposes, we will use a very minimal configuration file to give us an effective deployment of istiod and istio-ingressgateway:
# minimal-control-plane.yaml apiVersion: install.istio.io/v1alpha1 kind: IstioOperator spec: meshConfig: accessLogFile: /dev/stdout addonComponents: prometheus: enabled: false values: pilot: autoscaleEnabled: false gateways: istio-ingressgateway: autoscaleEnabled: false components: pilot: enabled: true k8s: resources: requests: cpu: 20m memory: 100Mi ingressGateways: - name: istio-ingressgateway enabled: true k8s: resources: requests: cpu: 20m memory: 40Mi
% istioctl manifest generate -f minimal-control-plane.yaml | kubectl apply -f -
% kubectl get deploy -n istio-system NAME READY UP-TO-DATE AVAILABLE AGE istio-ingressgateway 1/1 1 1 13m istiod 1/1 1 1 13m
A crucial piece of actually achieving zero-downtime is the use of custom user gateways, that is, ingress gateway deployments that are separate from the default
istio-ingressgateway deployment. Luckily for our customer, their initial deployment contained user gateways. Without this previously in place, it would be much more difficult to actually shift traffic as required.
To install the custom gateway, we will use a separate
IstioOperator resource to generate a manifest specifically for the custom gateway and apply it separately:
# minimal-gateways.yaml apiVersion: install.istio.io/v1alpha1 kind: IstioOperator metadata: name: custom-gw spec: profile: empty values: gateways: istio-ingressgateway: autoscaleEnabled: false components: ingressGateways: - name: istio-ingressgateway enabled: false - name: custom-gw namespace: default enabled: true k8s: overlays: - apiVersion: apps/v1 kind: Deployment name: custom-gw patches: - path: spec.template.spec.containers[name:istio-proxy].lifecycle value: preStop: exec: command: ["sh", "-c", "sleep 5"]
You will also notice an override to the
istio-proxy container to add a
preStop hook. This is an extra step for resiliency. See this GitHub issue for a brief discussion.
In this example, we are creating a user gateway named
custom-gw in the
default namespace. Let’s use this config to generate manifests for the user gateway and apply it.
% istioctl manifest generate -f minimal-gateways.yaml | kubectl apply -f -
Now we can confirm that we have created a
Service for this user gateway.
% kubectl get deploy NAME READY UP-TO-DATE AVAILABLE AGE custom-gw 1/1 1 1 19h details-v1 1/1 1 1 7d20h productpage-v1 1/1 1 1 7d20h ratings-v1 1/1 1 1 7d20h reviews-v1 1/1 1 1 7d20h reviews-v2 1/1 1 1 7d20h reviews-v3 1/1 1 1 7d20h
% kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) custom-gw LoadBalancer 10.84.26.233 220.127.116.11 15020:30476... details ClusterIP 10.84.19.186 9080/TCP kubernetes ClusterIP 10.84.16.1 443/TCP productpage ClusterIP 10.84.16.136 9080/TCP ratings ClusterIP 10.84.24.92 9080/TCP reviews ClusterIP 10.84.29.35 9080/TCP
Service is where our production traffic is flowing.
Gateway resource has been created that we will associate with our
% kubectl get gateway -A NAMESPACE NAME AGE default custom-gw 5m4s istio-system istio-ingressgateway 22m
To test this out, we will use the standard “bookinfo” application. You can follow the standard documentation to install.
However, since we are using a user gateway, let’s change the
VirtualService to reference it:
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: bookinfo spec: hosts: - "*" gateways: - custom-gw http: - match: - uri: exact: /productpage - uri: prefix: /static - uri: exact: /login - uri: exact: /logout - uri: prefix: /api/v1/products route: - destination: host: productpage port: number: 9080
Now, you should be able to access the product page of bookinfo through our user gateway
custom-gw. As we upgrade, we need to ensure that the traffic flowing to the
Service does not get impacted.
Canary installation of 1.6.x
We are now ready to actually perform the upgrade. At a high-level the procedure is as follows:
- Install a canary 1.6 control plane
- Copy the
ConfigMapfor the canary to the namespace containing the user gateways
- Install a second set of user gateways associated with the canary control plane
- Label the new gateway pods so they are selected by the existing user gateway
- Scale down the old user gateways
- Perform the rest of your data plane upgrade
- Uninstall the 1.5 control plane via manifests
First we will install the canary control plane. Note that we will be using the 1.6.14
istioctl for these commands.
% istioctl manifest generate -f ../istio-1.5.10/minimal-control-plane.yaml --revision=1-6-14 | kubectl apply -f -
We will generate and apply the Istio 1.6.14 manifests using the same
IstioOperator config we used for our initial installation but will also set the
--revision flag in order to generate the various resources needed for a canary instance of the control plane. In this case the revision is
% kubectl get po -n istio-system NAME READY STATUS RESTARTS AGE istio-ingressgateway-7c96949f94-6ggst 1/1 Running 0 109s istiod-1-6-14-59c9f9fcf5-wmb2b 1/1 Running 0 106s istiod-5b588486c8-29flb 1/1 Running 0 67m
Post canary install, we now have a new
istiod deployment that contains our revision as a suffix. You can also see that the
istio-ingressgateway-7c96949f94-6ggst pod has a similar age as the new
istiod pod, i.e. the
istio-ingressgateway pod was bounced as part of the canary control plane installation. This is due to the manifests generated for the revision not generating a separate
istio-ingressgateway for the canary revision, one of several reasons why performing a true zero-downtime upgrade with the default ingress gateways is not easily achieved. With user defined gateways we have greater control over the manifests.
The next step is to spin up a set of user gateways that are associated with the newly installed canary control plane.
Before doing this, there is an important detail to consider.
In order for the user gateways to connect to the correct
istiod, the gateway deployment must be configured accordingly. For our
1-6-14 revision, the relevant configuration is in the
istio-1-6-14 config map that is stored in the
% kubectl get cm -n istio-system istio-1-6-14 -o yaml apiVersion: v1 kind: ConfigMap metadata: labels: istio.io/rev: 1-6-14 release: istio name: istio-1-6-14 namespace: istio-system data: mesh: |- ... defaultConfig: ... discoveryAddress: istiod-1-6-14.istio-system.svc:15012 ... ... ...
However, in our scenario, the user gateways are deployed to the
default namespace, not the
istio-system namespace. Since
Deployments can’t reference
ConfigMaps in a different namespace, we need to copy the
ConfigMap to the
default namespace. Without this step, the correct configuration will not get mounted into the gateway pods, resulting in the user gateways connecting to the original 1.5
istiod, exactly what we are trying to avoid.
A straightforward approach is to simply get the correct config map from the
istio-system namespace, clean it up, and apply it to the
% kubectl get cm -n istio-system istio-1-6-14 -o yaml > istio-1-6-14-cm.yaml ## clean up state from config map such as annotations, resourceVersion etc. ## change the namespace to 'default' namespace % kubectl apply -f istio-1-6-14-cm.yaml
With the config map in place, we are safe to spin up the new user gateway pods. While we do this, we need to ensure the labels on the pod will be selected by the initial
Service. In other words, the original
Service will contain gateway pods synced to both the Istio 1.5 and 1.6 control planes.
IstioOperator resource will correctly generate the user gateway manifests we need. Of note, we have defined a
label override and several patches in order to create a gateway deployment specific for the 1-6-14 revision while still allowing the pods created by this deployment to be a part of the original
Service. If you want to test the uptime, this would be a good time to generate traffic.
# minimal-gateways-1.6.yaml apiVersion: install.istio.io/v1alpha1 kind: IstioOperator metadata: name: custom-gw spec: profile: empty values: gateways: istio-ingressgateway: autoscaleEnabled: false components: ingressGateways: - name: istio-ingressgateway enabled: false - name: custom-gw-1-6-14 namespace: default enabled: true label: istio: custom-gw-1-6-14 k8s: overlays: - apiVersion: v1 kind: Service name: custom-gw-1-6-14 patches: - path: spec.selector value: app: istio-ingressgateway istio: custom-gw service.istio.io/canonical-name: custom-gw-1-6-14 - apiVersion: apps/v1 kind: Deployment name: custom-gw-1-6-14 patches: - path: spec.template.spec.containers.[name:istio-proxy].args.[custom-gw-1-6-14] value: custom-gw - path: spec.template.metadata.labels.istio value: custom-gw - path: spec.selector.matchLabels value: service.istio.io/canonical-name: custom-gw-1-6-14 app: istio-ingressgateway istio: custom-gw - path: spec.template.spec.containers[name:istio-proxy].lifecycle value: preStop: exec: command: ["sh", "-c", "sleep 5"]
% istioctl manifest generate -f ./minimal-gateways-1.6.yaml -r 1-6-14 | kubectl apply -f -
% kubectl get deploy NAME READY UP-TO-DATE AVAILABLE AGE custom-gw 1/1 1 1 19h custom-gw-1-6-14 1/1 1 1 130m details-v1 1/1 1 1 7d20h productpage-v1 1/1 1 1 7d20h ratings-v1 1/1 1 1 7d20h reviews-v1 1/1 1 1 7d20h reviews-v2 1/1 1 1 7d20h reviews-v3 1/1 1 1 7d20h
% kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) custom-gw LoadBalancer 10.84.26.233 18.104.22.168 15020:30476... custom-gw-1-6-14 LoadBalancer 10.84.21.41 22.214.171.124 15021:32672... details ClusterIP 10.84.19.186 9080/TCP kubernetes ClusterIP 10.84.16.1 443/TCP productpage ClusterIP 10.84.16.136 9080/TCP ratings ClusterIP 10.84.24.92 9080/TCP reviews ClusterIP 10.84.29.35 9080/TCP
Now we have the expected
custom-gw-1-6-14 resources. We can confirm that the pod for the
custom-gw-1-6-14 is being treated as an endpoint for the initial
Service. By checking the
Endpoints for the
Service, we can see that the newly created pod
custom-gw-1-6-14-555579d667-wtz56 is included along with
% kubectl get endpoints custom-gw -o yaml apiVersion: v1 kind: Endpoints metadata: ... name: custom-gw namespace: default ... subsets: - addresses: - ip: 10.76.5.85 ... targetRef: kind: Pod name: custom-gw-657bcffdc6-kvlmz namespace: default ... - ip: 10.76.5.88 ... targetRef: kind: Pod name: custom-gw-1-6-14-555579d667-wtz56 namespace: default ... ports: ...
To confirm our gateways are hooked up the correct pilot instances, let’s check the status of the proxies in our data plane:
% istioctl proxy-status NAME ... PILOT VERSION custom-gw-1-6-14-555579d667-wtz56.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 custom-gw-657bcffdc6-kvlmz.default ... istiod-5b588486c8-29flb 1.5.10 details-v1-78db589446-wcljf.default ... istiod-5b588486c8-29flb 1.5.10 istio-ingressgateway-7c96949f94-6ggst.istio-system ... istiod-5b588486c8-29flb 1.6.14 productpage-v1-7f4cc988c6-n2gf8.default ... istiod-5b588486c8-29flb 1.5.10 ratings-v1-756b788d54-x62ct.default ... istiod-5b588486c8-29flb 1.5.10 reviews-v1-849fcdfd8b-bjd6x.default ... istiod-5b588486c8-29flb 1.5.10 reviews-v2-5b6fb6c4fb-pc7k8.default ... istiod-5b588486c8-29flb 1.5.10 reviews-v3-7d94d58566-jtqtt.default ... istiod-5b588486c8-29flb 1.5.10
Now that we have the new pod taking traffic, we can spin down the user gateway deployment. Once we have done that, our production ingress traffic will have successfully shifted from a 1.5.x gateway to a 1.6.x gateway.
To finish the data plane upgrade, we need to inject new sidecars associated with the canary control plane. One method for this is via automatic sidecar injection and correctly labeling a given namespace with the desired control plane revision. For more info, see the official Istio 1.6 documentation for canary upgrades. It’s also important to confirm your deployments are capable of zero-downtime rollouts, whether through the use of preStop hooks, graceful shutdowns, etc.
Once out the various application pods have been recreated, we should be in a state where all data plane proxies (including the user gateway) are solely connected to the canary pilot instance; all except the stubborn default
% istioctl proxy-status NAME ... PILOT VERSION custom-gw-1-6-14-555579d667-wtz56.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 details-v1-78db589446-l8l52.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 istio-ingressgateway-7c96949f94-6ggst.istio-system ... istiod-5b588486c8-29flb 1.6.14 productpage-v1-7f4cc988c6-w7ngw.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 ratings-v1-756b788d54-kn9t2.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 reviews-v1-849fcdfd8b-8ksm6.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 reviews-v2-5b6fb6c4fb-ksk2c.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 reviews-v3-7d94d58566-wlg2z.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14
As we are not sending application traffic through that gateway, it is safe to bounce the pod, resulting in a fully migrated data plane.
% istioctl proxy-status NAME ... PILOT VERSION custom-gw-1-6-14-555579d667-wtz56.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 details-v1-78db589446-l8l52.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 istio-ingressgateway-7c96949f94-8qnsp.istio-system ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 productpage-v1-7f4cc988c6-w7ngw.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 ratings-v1-756b788d54-kn9t2.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 reviews-v1-849fcdfd8b-8ksm6.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 reviews-v2-5b6fb6c4fb-ksk2c.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14 reviews-v3-7d94d58566-wlg2z.default ... istiod-1-6-14-59c9f9fcf5-wmb2b 1.6.14
Finally, we can clean up the original control plane and user gateways as they are no longer being used. Istio introduced an uninstall mechanism built into
istioctl with version 1.7.x. Since we are looking specifically at uninstalling the 1.5.x control plane, this won’t help us. For this scenario, we need to use the legacy method of generating a manifest based off our original
IstioOperator resource and using
Unfortunately for us, the manifest contains several resources that are still used by the Istio 1.6 installation, such as the custom resource definitions (CRDs) for Istio’s configuration. So in order to remove the old control plane, we need to extract the resources still being used by Istio and be sure not to remove them. The easiest approach will be to manually prune down the manifest. The details of this are left the readers imagination, but one example could be:
% istioctl manifest generate -f minimal-control-plane.yaml > istio-1.5-manifest.yaml # manually clean up the manifests to only include the components you wish to remove % kubectl delete -f istio-1.5-pruned.yaml
At this point the original control plane has been removed, but there is yet another problem to solve related to Istio’s validating webhook. Some care must be taken to ensure the webhook will still work, as this component is not intended to be versioned. Without doing so, you can no longer create new Istio resources such as
VirtualServices; in other words your control plane will be unusable. There are a few ways to workaround it, each with their own caveats and subtleties. For more info, you can follow the GitHub issue for this bug. If you are running into issues with this problem or filling in the blanks with any of the other steps outlined above, please reach out to us for additional assistance.
Istio has gone through a few different iterations of installation options and approaches for upgrades. Combined with the various ways customers actually leverage and use these options, you can imagine how complex an environment like this could look. Operating Istio in production successfully, especially across multiple clusters, zones, and regions or clouds. With products such as Gloo Mesh, which is a production supported distribution of Istio (with SLAs, LTS, and specialty builds such as FIPS, ARM, etc), we hope to automate away this minutaie and enable you to focus on delivering value.