Zero-downtime Upgrades with Istio 1.5 to 1.6 and Beyond

While the service mesh landscape has changed over the last few years, Istio has significantly grown in popularity. More and more organizations are running Istio in their production environments, and here at Solo.io we have been involved with many different organizations at various stages of their journey into production with Istio.

One of the main challenges with operating Istio is performing control-plane version upgrades, especially while trying to perform those upgrades with no downtime. While significant strides forward have been made to Istio’s upgrade workflow, it is a frequent source of frustration; even more so while working with older versions of Istio.

Istio 1.5 to 1.6 – a deep dive

Recently, we have been working with a customer currently running Istio 1.5.x in production and looking to upgrade. Given that this is production, the upgrade must be performed with no downtime. While the resulting process was a bit more manually involved than we would have liked, we successfully were able to upgrade from 1.5.10 to 1.6.14 with no interruption to their customer traffic. In this article we will explore the steps we took for this zero-downtime upgrade of Istio.

Baseline 1.5.x Installation

To begin, let’s setup our initial installation of Istio 1.5.10. For demonstration purposes, we will use a very minimal configuration file to give us an effective deployment of istiod and istio-ingressgateway:

# minimal-control-plane.yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    accessLogFile: /dev/stdout
  addonComponents:
    prometheus:
      enabled: false
  values:
    pilot:
      autoscaleEnabled: false
    gateways:
      istio-ingressgateway:
        autoscaleEnabled: false
  components:
    pilot:
      enabled: true
      k8s:
        resources:
          requests:
            cpu: 20m
            memory: 100Mi
    ingressGateways:
    - name: istio-ingressgateway
      enabled: true
      k8s:
        resources:
          requests:
            cpu: 20m
            memory: 40Mi
% istioctl manifest generate -f minimal-control-plane.yaml | kubectl apply -f -
% kubectl get deploy -n istio-system
NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
istio-ingressgateway   1/1     1            1           13m
istiod                 1/1     1            1           13m

A crucial piece of actually achieving zero-downtime is the use of custom user gateways, that is, ingress gateway deployments that are separate from the default istio-ingressgateway deployment. Luckily for our customer, their initial deployment contained user gateways. Without this previously in place, it would be much more difficult to actually shift traffic as required.

To install the custom gateway, we will use a separate IstioOperator resource to generate a manifest specifically for the custom gateway and apply it separately:

# minimal-gateways.yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: custom-gw
spec:
  profile: empty
  values:
    gateways:
      istio-ingressgateway:
        autoscaleEnabled: false
  components:
    ingressGateways:
    - name: istio-ingressgateway
      enabled: false
    - name: custom-gw
      namespace: default
      enabled: true
      k8s:
        overlays:
        - apiVersion: apps/v1
          kind: Deployment
          name: custom-gw
          patches:
          - path: spec.template.spec.containers[name:istio-proxy].lifecycle
            value:
               preStop:
                 exec:
                   command: ["sh", "-c", "sleep 5"]

You will also notice an override to the istio-proxy container to add a preStop hook. This is an extra step for resiliency. See this GitHub issue for a brief discussion.

In this example, we are creating a user gateway named custom-gw in the default namespace. Let’s use this config to generate manifests for the user gateway and apply it.

% istioctl manifest generate -f minimal-gateways.yaml | kubectl apply -f -

Now we can confirm that we have created a Deployment and Service for this user gateway.

% kubectl get deploy
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
custom-gw          1/1     1            1           19h
details-v1         1/1     1            1           7d20h
productpage-v1     1/1     1            1           7d20h
ratings-v1         1/1     1            1           7d20h
reviews-v1         1/1     1            1           7d20h
reviews-v2         1/1     1            1           7d20h
reviews-v3         1/1     1            1           7d20h
% kubectl get svc
NAME               TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S) 
custom-gw          LoadBalancer   10.84.26.233   34.69.221.142   15020:30476...
details            ClusterIP      10.84.19.186                   9080/TCP                                                                                                                                     
kubernetes         ClusterIP      10.84.16.1                     443/TCP                                                                                                                                      
productpage        ClusterIP      10.84.16.136                   9080/TCP                                                                                                                                     
ratings            ClusterIP      10.84.24.92                    9080/TCP                                                                                                                                     
reviews            ClusterIP      10.84.29.35                    9080/TCP                                                                                                                                     

The custom-gw Service is where our production traffic is flowing.

Additionally, a Gateway resource has been created that we will associate with our VirtualServices:

% kubectl get gateway -A
NAMESPACE      NAME                   AGE
default        custom-gw              5m4s
istio-system   istio-ingressgateway   22m

To test this out, we will use the standard “bookinfo” application. You can follow the standard documentation to install.

However, since we are using a user gateway, let’s change the VirtualService to reference it:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: bookinfo
spec:
  hosts:
  - "*"
  gateways:
  - custom-gw
  http:
  - match:
    - uri:
        exact: /productpage
    - uri:
        prefix: /static
    - uri:
        exact: /login
    - uri:
        exact: /logout
    - uri:
        prefix: /api/v1/products
    route:
    - destination:
        host: productpage
        port:
          number: 9080

Now, you should be able to access the product page of bookinfo through our user gateway custom-gw. As we upgrade, we need to ensure that the traffic flowing to the custom-gw Service does not get impacted.

Canary installation of 1.6.x

We are now ready to actually perform the upgrade. At a high-level the procedure is as follows:

  1. Install a canary 1.6 control plane
  2. Copy the istio ConfigMap for the canary to the namespace containing the user gateways
  3. Install a second set of user gateways associated with the canary control plane
  4. Label the new gateway pods so they are selected by the existing user gateway Service
  5. Scale down the old user gateways
  6. Perform the rest of your data plane upgrade
  7. Uninstall the 1.5 control plane via manifests

First we will install the canary control plane. Note that we will be using the 1.6.14 istioctl for these commands.

% istioctl manifest generate -f ../istio-1.5.10/minimal-control-plane.yaml --revision=1-6-14 | kubectl apply -f -

We will generate and apply the Istio 1.6.14 manifests using the same IstioOperator config we used for our initial installation but will also set the --revision flag in order to generate the various resources needed for a canary instance of the control plane. In this case the revision is 1-6-14.

% kubectl get po -n istio-system
NAME                                    READY   STATUS    RESTARTS   AGE
istio-ingressgateway-7c96949f94-6ggst   1/1     Running   0          109s
istiod-1-6-14-59c9f9fcf5-wmb2b          1/1     Running   0          106s
istiod-5b588486c8-29flb                 1/1     Running   0          67m

Post canary install, we now have a new istiod deployment that contains our revision as a suffix. You can also see that the istio-ingressgateway-7c96949f94-6ggst pod has a similar age as the new istiod pod, i.e. the istio-ingressgateway pod was bounced as part of the canary control plane installation. This is due to the manifests generated for the revision not generating a separate istio-ingressgateway for the canary revision, one of several reasons why performing a true zero-downtime upgrade with the default ingress gateways is not easily achieved. With user defined gateways we have greater control over the manifests.


The next step is to spin up a set of user gateways that are associated with the newly installed canary control plane.
Before doing this, there is an important detail to consider.

In order for the user gateways to connect to the correct istiod, the gateway deployment must be configured accordingly. For our 1-6-14 revision, the relevant configuration is in the istio-1-6-14 config map that is stored in the istio-system namespace:

% kubectl get cm -n istio-system istio-1-6-14 -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    istio.io/rev: 1-6-14
    release: istio
  name: istio-1-6-14
  namespace: istio-system
data:
  mesh: |-
    ...
    defaultConfig:
      ...
      discoveryAddress: istiod-1-6-14.istio-system.svc:15012
      ...
    ...
  ...

However, in our scenario, the user gateways are deployed to the default namespace, not the istio-system namespace. Since Deployments can’t reference ConfigMaps in a different namespace, we need to copy the ConfigMap to the default namespace. Without this step, the correct configuration will not get mounted into the gateway pods, resulting in the user gateways connecting to the original 1.5 istiod, exactly what we are trying to avoid.

A straightforward approach is to simply get the correct config map from the istio-system namespace, clean it up, and apply it to the default namespace:

% kubectl get cm -n istio-system istio-1-6-14 -o yaml > istio-1-6-14-cm.yaml

## clean up state from config map such as annotations, resourceVersion etc.
## change the namespace to 'default' namespace

% kubectl apply -f istio-1-6-14-cm.yaml

With the config map in place, we are safe to spin up the new user gateway pods. While we do this, we need to ensure the labels on the pod will be selected by the initial custom-gw Service. In other words, the original custom-gw Service will contain gateway pods synced to both the Istio 1.5 and 1.6 control planes.

The following IstioOperator resource will correctly generate the user gateway manifests we need. Of note, we have defined a label override and several patches in order to create a gateway deployment specific for the 1-6-14 revision while still allowing the pods created by this deployment to be a part of the original custom-gw Service. If you want to test the uptime, this would be a good time to generate traffic.

# minimal-gateways-1.6.yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: custom-gw
spec:
  profile: empty
  values:
    gateways:
      istio-ingressgateway:
        autoscaleEnabled: false
  components:
    ingressGateways:
    - name: istio-ingressgateway
      enabled: false
    - name: custom-gw-1-6-14
      namespace: default
      enabled: true
      label:
        istio: custom-gw-1-6-14
      k8s:
        overlays:
        - apiVersion: v1
          kind: Service
          name: custom-gw-1-6-14
          patches:
          - path: spec.selector
            value:
              app: istio-ingressgateway
              istio: custom-gw
              service.istio.io/canonical-name: custom-gw-1-6-14
        - apiVersion: apps/v1
          kind: Deployment
          name: custom-gw-1-6-14
          patches:
          - path: spec.template.spec.containers.[name:istio-proxy].args.[custom-gw-1-6-14]
            value: custom-gw
          - path: spec.template.metadata.labels.istio
            value: custom-gw
          - path: spec.selector.matchLabels
            value:
              service.istio.io/canonical-name: custom-gw-1-6-14
              app: istio-ingressgateway
              istio: custom-gw
          - path: spec.template.spec.containers[name:istio-proxy].lifecycle
            value:
               preStop:
                 exec:
                   command: ["sh", "-c", "sleep 5"]
% istioctl manifest generate -f ./minimal-gateways-1.6.yaml -r 1-6-14 | kubectl apply -f -
% kubectl get deploy
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
custom-gw          1/1     1            1           19h
custom-gw-1-6-14   1/1     1            1           130m
details-v1         1/1     1            1           7d20h
productpage-v1     1/1     1            1           7d20h
ratings-v1         1/1     1            1           7d20h
reviews-v1         1/1     1            1           7d20h
reviews-v2         1/1     1            1           7d20h
reviews-v3         1/1     1            1           7d20h
% kubectl get svc
NAME               TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S) 
custom-gw          LoadBalancer   10.84.26.233   34.69.221.142   15020:30476...
custom-gw-1-6-14   LoadBalancer   10.84.21.41    35.224.185.31   15021:32672...
details            ClusterIP      10.84.19.186                   9080/TCP                                                                                                                                     
kubernetes         ClusterIP      10.84.16.1                     443/TCP                                                                                                                                      
productpage        ClusterIP      10.84.16.136                   9080/TCP                                                                                                                                     
ratings            ClusterIP      10.84.24.92                    9080/TCP                                                                                                                                     
reviews            ClusterIP      10.84.29.35                    9080/TCP                                                                                                                                     

Now we have the expected custom-gw-1-6-14 resources. We can confirm that the pod for the custom-gw-1-6-14 is being treated as an endpoint for the initial Service. By checking the Endpoints for the custom-gw Service, we can see that the newly created pod custom-gw-1-6-14-555579d667-wtz56 is included along with custom-gw-657bcffdc6-kvlmz.

% kubectl get endpoints custom-gw -o yaml
apiVersion: v1
kind: Endpoints
metadata:
  ...
  name: custom-gw
  namespace: default
  ...
subsets:
- addresses:
  - ip: 10.76.5.85
    ...
    targetRef:
      kind: Pod
      name: custom-gw-657bcffdc6-kvlmz
      namespace: default
      ...
  - ip: 10.76.5.88
    ...
    targetRef:
      kind: Pod
      name: custom-gw-1-6-14-555579d667-wtz56
      namespace: default
      ...
  ports:
  ...

To confirm our gateways are hooked up the correct pilot instances, let’s check the status of the proxies in our data plane:

% istioctl proxy-status
NAME                                                   ...     PILOT                              VERSION
custom-gw-1-6-14-555579d667-wtz56.default              ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14
custom-gw-657bcffdc6-kvlmz.default                     ...     istiod-5b588486c8-29flb            1.5.10
details-v1-78db589446-wcljf.default                    ...     istiod-5b588486c8-29flb            1.5.10
istio-ingressgateway-7c96949f94-6ggst.istio-system     ...     istiod-5b588486c8-29flb            1.6.14
productpage-v1-7f4cc988c6-n2gf8.default                ...     istiod-5b588486c8-29flb            1.5.10
ratings-v1-756b788d54-x62ct.default                    ...     istiod-5b588486c8-29flb            1.5.10
reviews-v1-849fcdfd8b-bjd6x.default                    ...     istiod-5b588486c8-29flb            1.5.10
reviews-v2-5b6fb6c4fb-pc7k8.default                    ...     istiod-5b588486c8-29flb            1.5.10
reviews-v3-7d94d58566-jtqtt.default                    ...     istiod-5b588486c8-29flb            1.5.10

Now that we have the new pod taking traffic, we can spin down the user gateway deployment. Once we have done that, our production ingress traffic will have successfully shifted from a 1.5.x gateway to a 1.6.x gateway.

To finish the data plane upgrade, we need to inject new sidecars associated with the canary control plane. One method for this is via automatic sidecar injection and correctly labeling a given namespace with the desired control plane revision. For more info, see the official Istio 1.6 documentation for canary upgrades. It’s also important to confirm your deployments are capable of zero-downtime rollouts, whether through the use of preStop hooks, graceful shutdowns, etc.

Once out the various application pods have been recreated, we should be in a state where all data plane proxies (including the user gateway) are solely connected to the canary pilot instance; all except the stubborn default istio-ingressgateway pod.

% istioctl proxy-status
NAME                                                   ...     PILOT                              VERSION
custom-gw-1-6-14-555579d667-wtz56.default              ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14
details-v1-78db589446-l8l52.default                    ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14
istio-ingressgateway-7c96949f94-6ggst.istio-system     ...     istiod-5b588486c8-29flb            1.6.14
productpage-v1-7f4cc988c6-w7ngw.default                ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14
ratings-v1-756b788d54-kn9t2.default                    ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14
reviews-v1-849fcdfd8b-8ksm6.default                    ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14
reviews-v2-5b6fb6c4fb-ksk2c.default                    ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14
reviews-v3-7d94d58566-wlg2z.default                    ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14

As we are not sending application traffic through that gateway, it is safe to bounce the pod, resulting in a fully migrated data plane.

% istioctl proxy-status
NAME                                                   ...     PILOT                              VERSION
custom-gw-1-6-14-555579d667-wtz56.default              ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14
details-v1-78db589446-l8l52.default                    ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14
istio-ingressgateway-7c96949f94-8qnsp.istio-system     ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14
productpage-v1-7f4cc988c6-w7ngw.default                ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14
ratings-v1-756b788d54-kn9t2.default                    ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14
reviews-v1-849fcdfd8b-8ksm6.default                    ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14
reviews-v2-5b6fb6c4fb-ksk2c.default                    ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14
reviews-v3-7d94d58566-wlg2z.default                    ...     istiod-1-6-14-59c9f9fcf5-wmb2b     1.6.14

Clean up

Finally, we can clean up the original control plane and user gateways as they are no longer being used. Istio introduced an uninstall mechanism built into istioctl with version 1.7.x. Since we are looking specifically at uninstalling the 1.5.x control plane, this won’t help us. For this scenario, we need to use the legacy method of generating a manifest based off our original IstioOperator resource and using kubectl delete.

Unfortunately for us, the manifest contains several resources that are still used by the Istio 1.6 installation, such as the custom resource definitions (CRDs) for Istio’s configuration. So in order to remove the old control plane, we need to extract the resources still being used by Istio and be sure not to remove them. The easiest approach will be to manually prune down the manifest. The details of this are left the readers imagination, but one example could be:

% istioctl manifest generate -f minimal-control-plane.yaml > istio-1.5-manifest.yaml

# manually clean up the manifests to only include the components you wish to remove

% kubectl delete -f istio-1.5-pruned.yaml

At this point the original control plane has been removed, but there is yet another problem to solve related to Istio’s validating webhook. Some care must be taken to ensure the webhook will still work, as this component is not intended to be versioned. Without doing so, you can no longer create new Istio resources such as VirtualServices; in other words your control plane will be unusable. There are a few ways to workaround it, each with their own caveats and subtleties. For more info, you can follow the GitHub issue for this bug. If you are running into issues with this problem or filling in the blanks with any of the other steps outlined above, please reach out to us for additional assistance.

Wrapping up

Istio has gone through a few different iterations of installation options and approaches for upgrades. Combined with the various ways customers actually leverage and use these options, you can imagine how complex an environment like this could look. Operating Istio in production successfully, especially across multiple clusters, zones, and regions or clouds. With products such as Gloo Mesh, which is a production supported distribution of Istio (with SLAs, LTS, and specialty builds such as FIPS, ARM, etc), we hope to automate away this minutaie and enable you to focus on delivering value.