Upgrading Istio without Downtime

This blog originally appeared on The New Stack.

As of writing this blog, Istio is about to unveil its 1.11 version, so it’s a good time for us to talk about a method for upgrading Istio without downtime. If you have been working with Istio for some time and have it deployed to your environments, you might be running a significantly older version. In just two weeks on Aug 18th, 2021 Istio 1.9 will no longer be supported. This means that your older versions are no longer receiving critical patches and updates that help keep you secure. Solo.io does support older versions of Istio via Gloo Mesh and its Long Term Support (LTS) for current release and the previous 4 releases.

Nevertheless, upgrading is a good idea so let’s pay down some of your technical debt and upgrade your Istio deployment so that your applications can take advantage of the latest features as well as stay secure. This blog will explain an architecture and process that will help you get your Istio deployment back into compliance and set you up for easier upgrades in the future.

Upgrading one version at a time

User-uploaded Image

Istio.io upgrade warning

Istio recommends that you upgrade one minor version at a time, up to 1.8 in which you can skip 1.9 to 1.10. This means that if you are still on Istio 1.6, they recommend that you upgrade 3 times to get to 1.10 (1.6→1.7→1.8→1.10). With the proposed architecture laid out below, you may be able to skip these versions by being able to test them side by side. This might save you a number of hours and allow you to catch up safely and more efficiently.

The two failure modes for Istio

First let’s talk about the two main ways Istio can impact your workloads due to an outage.

The first failure mode to discuss is the loss of configuration propagation the istio sidecars. If your istio-agent sidecars lose the ability to communicate with Istiod or are incompatible with the configuration being sent, your workloads will not be able to join or communicate with the mesh. This can even impact existing workloads as endpoint discovery will not be up to date and you may try to reach workloads that no longer exist. New workloads however will not be able to join and will remain down until the issue is resolved. Due to this type of outage, it is recommended for istio-agents to match and retain the same version as the control plane (Istiod). It also makes sense that during an upgrade the existing control plane deployment remain in place rather than upgrading it directly. It is desirable to do a blue/green deployment as a step toward upgrading Istio without downtime.

User-uploaded Image

Blue / Green Istio Control Plane Deployment

 

The second failure mode, and often the more critical, is loss of traffic flow through the ingress gateway. Unlike the loss of control plane, an outage in the ingress gateway will have an immediate impact on your end users. Since this is a critical path for the flow of traffic, extra care should be taken for upgrading Istio without downtime. That includes being able to fall back to the existing gateway if the upgrade fails. That is why it is also recommended to blue/green ingress gateway deployments. Shown below is an example upgrade using an external Kubernetes LoadBalanced service that can select the “Blue” or “Green” ingress gateway for traffic flow.

User-uploaded Image

Blue / Green Gateway Upgrade

An architecture for upgrading Istio without downtime

Extending on the mitigations for the two failure domains, we can show how some of the newer Istio features can help us in deployment and upgrading Istio without downtime. This solution relies heavily on the Istio Canary Deployment feature. Introduced in 1.6, It allows us to deploy multiple versions of the Istio control plane side by side and migrate workloads. We can also use the same mechanism to canary deploy ingress gateways, using our own managed LoadBalanced service.

User-uploaded Image

Zero downtime Istio deployment architecture

The current best approach is to use the Istio operator to deploy Istio components utilizing the IstioOperator configuration. Due to problems with operator compatibility between versions, we have to deploy a new operator for every version upgrade. Due to this constraint, it is probably just as easy to deploy Istio via helm, and we may recommend that in the future. Its not being recommended today due to the convenience that the IstioOperator CRD offers over the traditional helm values file. Once we have the operators deployed we can deploy our multiple IstioOperator configurations (one for istiod, one for each gateway). Example configurations of each are shown below.

Example Istiod deployment with revision label. Would be deployed by the 1-9-7 Istio operator.

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: 1-9-7
  namespace: istio-system
spec:
  profile: minimal
  tag: 1.9.7
  revision: 1-9-7

  # Traffic management feature
  components:
    # Istio Gateway feature
    # Disable gateways deployments because they will be in separate IstioOperator configs
    ingressGateways:
    - name: istio-ingressgateway
      enabled: false
    - name: istio-eastwestgateway
      enabled: false
    egressGateways:
    - name: istio-egressgateway
      enabled: false

Example gateway deployment with custom LoadBalanced service. We can use the service selector to blue green future versions of gateway deployments. Note: we must modify the ingress gateway service to be a ClusterIP service so it will not create its own load balancer.

apiVersion: v1
kind: Service
metadata:
  name: istio-ingressgateway
  namespace: istio-gateways
spec:
  type: LoadBalancer
  selector:
    istio: ingressgateway
    # select the 1-9-7 revision
    version: 1-9-7
  ports:
  - name: status-port
    port: 15021
    targetPort: 15021
  - name: http2
    port: 80
    targetPort: 8080
  - name: https
    port: 443
    targetPort: 8443
  - name: tcp
    port: 31400
    targetPort: 31400
  - name: tls
    port: 15443
    targetPort: 15443
---
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: ingress-gateway-1-9-7
  namespace: istio-gateways
spec:
  profile: empty
  tag: 1.9.7
  revision: 1-9-7
  components:
    ingressGateways:
      - name: istio-ingressgateway-1-9-7
        namespace: istio-gateways
        enabled: true
        label:
          istio: ingressgateway
          version: 1-9-7
          app: istio-ingressgateway
        k8s:
          service:
            # Since we created our own LoadBalanced service, tell istio to create a ClusterIP service for this gateway
            type: ClusterIP

How to migrate your sidecars

Once your new Istio control plane has been deployed you can migrate your application workloads to the new Istiod deployment. If you are already using revisions and the revision label it should be as simple as updating the namespace label istio.io/rev=<new_revision>. Then you will need to recreate your pods to get the updated proxy sidecars.

Example rolling restart command to update sidecar version.

kubectl rollout restart deployment/nginx -n nginx
User-uploaded Image

Migrating to revisions

If you are not currently using revisions or canary deployments, it still is easy and recommended to migrate to them. The pattern for migration is strikingly similar to upgrading between versions. We would recommend deploying the same Istio version next to your existing deployment but with the added revision label. Then you can migrate your application sidecars at your leisure by removing the istio-injection=enabled label and adding the new revision label istio.io/rev=<revision>.

Migrating the gateways may prove to be more difficult however. If your current Istio deployment owns the LoadBalanced service you will have to take extra care when removing the exiting infrastructure. It may be easier in some cases to migrate to a new LoadBalanced service.

Try out our “upgrading Istio without downtime” demo!

If you are interested in trying this for yourself, we have created a lab to test it. We deploy two versions of istio and migrate the bookinfo applications while requesting traffic from them. Once complete we take a look at that traffic to make sure it worked without issue.

Istio Upgrade Demo

Istio upgrade without downtime graph

Of course, you don’t have to go it alone!

Solo offers enterprise production long-term support (LTS) for Istio and fixes CVEs quickly with patches and backports.