Istio Multi-Cluster on Red Hat OpenShift
Many of you have chosen Red Hat OpenShift to orchestrate Kubernetes containers on-premises. At the same time, most of you are also adopting (or about to adopt) a service mesh to connect your containerized applications, and Istio is becoming the de facto industry standard for service mesh management. Open source Istio has many useful features, including service-to-service communications with mutual Transport Layer Security (mTLS), enabling canary deployments of new software builds, and providing telemetry data for observability. As your reliance on Istio within OpenShift increases, you’ll very quickly realize that to avoid unplanned interruptions you should run Istio on multiple clusters and set up cross cluster communications and failover. These features increase reliability and reduce your risk.
Gloo Mesh is a management plane that simplifies operations and workflows of service mesh installations across multiple clusters and deployment footprints, building on the strengths of Istio. With Gloo Mesh, you can install, discover, and operate a service-mesh deployment across your enterprise, deployed on premises, or in the cloud, even across heterogeneous service-mesh implementations. In this blog, I’ll explain how to deploy Istio (1.9) on multiple OpenShift (4.6.22) clusters on IBM cloud and how to leverage Gloo Mesh for:
- mTLS between pods running on different clusters
- Locality-based failover
- Global access control
- Global observability
Preparation
First of all, we need a few OpenShift clusters, three in fact. We’ll deploy the management plane, Gloo Mesh, on one of these clusters and Istio on the other two clusters. Be aware that Gloo Mesh could also be deployed on one of these Istio clusters, but here we’ll deploy it on a separate cluster to show that it doesn’t depend on Istio.
Let’s deploy these three OpenShift clusters using IBM cloud. It’s easy and our cost can be covered by free credits from IBM for this exercise, though this approach would run the same anywhere. For our example, three worker nodes (b3c.4×16 – 4 vCPUs 16GB RAM) per cluster is more than enough. When the OpenShift clusters are ready, let’s rename the Kubernetes contexts to use the following: mgmt
, cluster1
and cluster2
.
How to deploy Istio on OpenShift
There are a few specific things we have to do to deploy Istio on OpenShift, but they are well documented here. By default, OpenShift doesn’t allow containers running with user ID 0. Let’s enable containers running with UID 0 for Istio’s service accounts by running the commands below:
oc --context cluster1 adm policy add-scc-to-group anyuid system:serviceaccounts:istio-system oc --context cluster1 adm policy add-scc-to-group anyuid system:serviceaccounts:istio-operator oc --context cluster2 adm policy add-scc-to-group anyuid system:serviceaccounts:istio-system oc --context cluster2 adm policy add-scc-to-group anyuid system:serviceaccounts:istio-operator
Note that the second command isn’t in the Istio documentation, but it is needed to deploy Istio using the Operator approach. We can deploy Istio on cluster1
using the following yaml:
apiVersion: install.istio.io/v1alpha1 kind: IstioOperator metadata: name: istiocontrolplane-default namespace: istio-system spec: profile: openshift meshConfig: accessLogFile: /dev/stdout enableAutoMtls: true defaultConfig: envoyMetricsService: address: enterprise-agent.gloo-mesh:9977 envoyAccessLogService: address: enterprise-agent.gloo-mesh:9977 proxyMetadata: ISTIO_META_DNS_CAPTURE: "true" ISTIO_META_DNS_AUTO_ALLOCATE: "true" GLOO_MESH_CLUSTER_NAME: cluster1 values: global: meshID: mesh1 multiCluster: clusterName: cluster1 trustDomain: cluster1 network: network1 meshNetworks: network1: endpoints: - fromRegistry: cluster1 gateways: - registryServiceName: istio-ingressgateway.istio-system.svc.cluster.local port: 443 vm-network: components: ingressGateways: - name: istio-ingressgateway label: topology.istio.io/network: network1 enabled: true k8s: env: # sni-dnat adds the clusters required for AUTO_PASSTHROUGH mode - name: ISTIO_META_ROUTER_MODE value: "sni-dnat" # traffic through this gateway should be routed inside the network - name: ISTIO_META_REQUESTED_NETWORK_VIEW value: network1 service: ports: - name: http2 port: 80 targetPort: 8080 - name: https port: 443 targetPort: 8443 - name: tcp-status-port port: 15021 targetPort: 15021 - name: tls port: 15443 targetPort: 15443 - name: tcp-istiod port: 15012 targetPort: 15012 - name: tcp-webhook port: 15017 targetPort: 15017 pilot: k8s: env: - name: PILOT_SKIP_VALIDATE_TRUST_DOMAIN value: "true"
Some of the values aren’t mandatory, but we will use the same yaml to demonstrate other features, like virtual machine (VM) integration (stay tuned, we’ll probably write another blog on that topic soon!)
Notice a few values:
- the
profile
value is set toopenshift
. - the `envoyMetricsService` and `envoyAccessLogService` values which allow Gloo Mesh to consolidate metrics and access logs globally.
- the
trustDomain
value which ensures a unique identity for each pod globally (as soon as as they use their own Kubernetes Service Account.)
Let’s use the same yaml to deploy Istion on cluster2
, but replaced cluster1
by cluster2
and network1
by network2
. Note that we use different network values because we don’t have a flat network (the pods from different clusters can’t communicate directly.) Everything described in this blog would also work with a flat network.
After installation is complete, we need to expose an OpenShift route for the ingress gateway on each cluster:
oc --context cluster1 -n istio-system expose svc/istio-ingressgateway --port=http2 oc --context cluster2 -n istio-system expose svc/istio-ingressgateway --port=http2
Gloo Mesh deployment
Let’s install Gloo Mesh using the Helm chart and the following options:
helm install gloo-mesh-enterprise gloo-mesh-enterprise/gloo-mesh-enterprise \ --namespace gloo-mesh --kube-context mgmt \ --set licenseKey=${GLOO_MESH_LICENSE_KEY} \ --set gloo-mesh-ui.GlooMeshDashboard.apiserver.floatingUserId=true
By default, the kubernetes-admin
user is granted the Gloo Mesh admin role, but when deploying OpenShift on IBM cloud, the user has a different name. Let’s use the following snippet to update the role binding:
cat > rolebinding-patch.yaml <<EOF spec: roleRef: name: admin-role namespace: gloo-mesh subjects: - kind: User name: $(kubectl --context mgmt get user -o jsonpath='{.items[0].metadata.name}') EOF kubectl --context mgmt -n gloo-mesh patch rolebindings.rbac.enterprise.mesh.gloo.solo.io admin-role-binding --type=merge --patch "$(cat rolebinding-patch.yaml)"
Istio clusters registration
To register the Istio clusters, we need to find the external IP of the Gloo Mesh service:
SVC=$(kubectl --context mgmt -n gloo-mesh get svc enterprise-networking -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
Now, we can register the Istio clusters using meshctl
:
meshctl cluster register --mgmt-context=mgmt --remote-context=cluster1 --relay-server-address=$SVC:9900 enterprise cluster1 --cluster-domain cluster.local meshctl cluster register --mgmt-context=mgmt --remote-context=cluster2 --relay-server-address=$SVC:9900 enterprise cluster2 --cluster-domain cluster.local
Bookinfo application deployment
The Istio sidecar injected into each application pod runs with user ID 1337, which is not allowed by default in OpenShift. To allow this user ID to be used, execute the following commands:
oc --context cluster1 adm policy add-scc-to-group privileged system:serviceaccounts:default oc --context cluster1 adm policy add-scc-to-group anyuid system:serviceaccounts:default
CNI on OpenShift is managed by Multus, and it requires a NetworkAttachmentDefinition to be present in the application namespace in order to invoke the istio-cni plugin. We do that by executing the following commands:
cat <<EOF | oc --context cluster1 -n default create -f - apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: istio-cni EOF
Now, we can deploy the bookinfo
application on cluster1
:
kubectl --context cluster1 label namespace default istio-injection=enabled # deploy bookinfo application components for all versions less than v3 kubectl --context cluster1 apply -f https://raw.githubusercontent.com/istio/istio/1.8.2/samples/bookinfo/platform/kube/bookinfo.yaml -l 'app,version notin (v3)' # deploy all bookinfo service accounts kubectl --context cluster1 apply -f https://raw.githubusercontent.com/istio/istio/1.8.2/samples/bookinfo/platform/kube/bookinfo.yaml -l 'account' # configure ingress gateway to access bookinfo kubectl --context cluster1 apply -f https://raw.githubusercontent.com/istio/istio/1.8.2/samples/bookinfo/networking/bookinfo-gateway.yaml
As you can see, we deployed everything but the version v3
of the reviews
service. We can follow the same steps to deploy the bookinfo
application on the second cluster, but including the version v3
this time.
Here is a diagram of the current situation:
Mesh federation
Gloo Mesh makes it very easy to federate the different Istio clusters. We simply need to create a Virtual Mesh using the following yaml:
apiVersion: networking.mesh.gloo.solo.io/v1 kind: VirtualMesh metadata: name: virtual-mesh namespace: gloo-mesh spec: mtlsConfig: autoRestartPods: true shared: rootCertificateAuthority: generated: {} federation: {} globalAccessPolicy: ENABLED meshes: - name: istiod-istio-system-cluster1 namespace: gloo-mesh - name: istiod-istio-system-cluster2 namespace: gloo-mesh
It triggers the creation of a Root certificate (but we could have provided our own) and the generation of intermediate CA certificates for the different Istio clusters. Basically, it automates what you should manually do following this Istio documentation. The second thing that you get when you create a Virtual Mesh is workload discovery. Gloo Mesh will discover all the services running on one cluster and make the other clusters aware of them using ServiceEntries
. For example, here are the ServiceEntries
that have been created on `cluster1`:
NAMESPACE NAME HOSTS LOCATION RESOLUTION AGE istio-system details.default.svc.cluster2.global [details.default.svc.cluster2.global] MESH_INTERNAL DNS 8m12s istio-system istio-ingressgateway.istio-system.svc.cluster2.global [istio-ingressgateway.istio-system.svc.cluster2.global] MESH_INTERNAL DNS 8m12s istio-system productpage.default.svc.cluster2.global [productpage.default.svc.cluster2.global] MESH_INTERNAL DNS 8m12s istio-system ratings.default.svc.cluster2.global [ratings.default.svc.cluster2.global] MESH_INTERNAL DNS 8m12s istio-system reviews.default.svc.cluster2.global [reviews.default.svc.cluster2.global] MESH_INTERNAL DNS 8m12s
As you see, cluster1
is now aware of the services running on cluster2
. Let’s have a look at one of these ServiceEntries:
apiVersion: networking.istio.io/v1beta1 kind: ServiceEntry metadata: labels: cluster.multicluster.solo.io: "" owner.networking.mesh.gloo.solo.io: gloo-mesh relay-agent: cluster1 name: reviews.default.svc.cluster2.global namespace: istio-system spec: addresses: - x.x.x.x endpoints: - address: y.y.y.y labels: cluster: cluster2 ports: http: 15443 hosts: - reviews.default.svc.cluster2.global location: MESH_INTERNAL ports: - name: http number: 9080 protocol: HTTP resolution: DNS
You can see that Gloo Mesh has assigned a unique IP address and built an endpoint entry to specify how to reach this service (using the ingress gateway of cluster2
in this case.)
Note that there’s a native multi-cluster discovery mechanism in Istio called Endpoint Discovery Service (EDS), but it has some limitations, namely:
- it doesn’t create
ServiceEntries
, so the users don’t have any visibility on what services have been discovered - each Istio cluster needs to discover the services of all the other clusters, so it generates more load
- each Istio cluster discovers services by communicating with the Kubernetes API servers of the other clusters, so you need to share the Kubernetes credentials of all the clusters with all the clusters. It’s a security concern.
- if a Kubernetes API server isn’t available,
istiod
can’t start
The Gloo Mesh discovery mechanism doesn’t have any of these limitations. The services are discovered by a local agent running on each cluster which passes the information to the management plane through a secure gRPC channel.
Global Access Control
Perhaps you’ve noticed that we have created the Virtual Mesh with the option `globalAccessPolicy` enabled. When doing so, Gloo Mesh creates Istio AuthorizationPolicies
on all the Istio clusters to make all the service to service communications forbidden by default. We can create Gloo Mesh AccessPolicies
to define what services are allowed to communicate together (globally). Here is an example to allow the productpage service on cluster1
to communicate with the reviews
and the ratings
services on any cluster:
apiVersion: networking.mesh.gloo.solo.io/v1 kind: AccessPolicy metadata: namespace: gloo-mesh name: productpage spec: sourceSelector: - kubeServiceAccountRefs: serviceAccounts: - name: bookinfo-productpage namespace: default clusterName: cluster1 destinationSelector: - kubeServiceMatcher: namespaces: - default labels: service: details - kubeServiceMatcher: namespaces: - default labels: service: reviews
Gloo Mesh will translate this AccessPolicy
in the following AuthorizationPolicies
on each cluster:
apiVersion: security.istio.io/v1beta1 kind: AuthorizationPolicy metadata: labels: cluster.multicluster.solo.io: "" owner.networking.mesh.gloo.solo.io: gloo-mesh relay-agent: cluster1 name: details namespace: default spec: rules: - from: - source: principals: - cluster1/ns/default/sa/bookinfo-productpage selector: matchLabels: app: details
and
apiVersion: security.istio.io/v1beta1 kind: AuthorizationPolicy metadata: labels: cluster.multicluster.solo.io: "" owner.networking.mesh.gloo.solo.io: gloo-mesh relay-agent: cluster1 name: reviews namespace: default spec: rules: - from: - source: principals: - cluster1/ns/default/sa/bookinfo-productpage selector: matchLabels: app: reviews
Gloo Mesh also shows which current running services match the criteria of the policy:
Cross cluster communication
To allow a service on one cluster to communicate with a service of another cluster, you would normally need to create several Istio objects (such as VirtualServices
, DestinationRules
.) Gloo Mesh makes this easier with TrafficPolicies
. Here is an example to define a traffic shift between different versions of a services running in different clusters:
apiVersion: networking.mesh.gloo.solo.io/v1 kind: TrafficPolicy metadata: namespace: gloo-mesh name: simple spec: sourceSelector: - kubeWorkloadMatcher: namespaces: - default destinationSelector: - kubeServiceRefs: services: - clusterName: cluster1 name: reviews namespace: default policy: trafficShift: destinations: - kubeService: clusterName: cluster2 name: reviews namespace: default subset: version: v3 weight: 75 - kubeService: clusterName: cluster1 name: reviews namespace: default subset: version: v1 weight: 15 - kubeService: clusterName: cluster1 name: reviews namespace: default subset: version: v2 weight: 10
Very easy, no? Gloo Mesh can also be used to define locality based failover. The example below defines a new hostname called reviews.global
which will be available on any cluster. This allows a service to communicate with the local reviews
service if it’s available or automatically use a remote instance if it’s not (going to the next zone first, then the next region):
apiVersion: networking.enterprise.mesh.gloo.solo.io/v1beta1 kind: VirtualDestination metadata: name: reviews-global namespace: gloo-mesh spec: hostname: reviews.global port: number: 9080 protocol: http localized: outlierDetection: consecutiveErrors: 1 maxEjectionPercent: 100 interval: 5s baseEjectionTime: 120s destinationSelectors: - kubeServiceMatcher: labels: app: reviews virtualMesh: name: virtual-mesh namespace: gloo-mesh
A TrafficPolicy
can then be created to use the VirtualDestination
in a transparent way (without explicitly sending requests to the reviews.global
hostname.)
Global observability
As explained at the beginning of the post, we deployed Istio to have all the Envoy Proxies sending their metrics to the local Gloo Mesh agent which is then passing this data to the Gloo Mesh management plane. We can then configure a Prometheus instance to scrape the metrics from Gloo Mesh and visualize them using Grafana or Kiali. We’ll also be adding the ability to see visualizations directly in the Gloo Mesh admin dashboard in the near future. We can also gather access logs globally on demand.
Let’s say we have an issue with the reviews
service on all the clusters and want to understand what’s going on. We can start gathering all the corresponding access logs by creating an `AccessLogRecord`:
apiVersion: observability.enterprise.mesh.gloo.solo.io/v1 kind: AccessLogRecord metadata: name: access-log-reviews namespace: gloo-mesh spec: workloadSelectors: - kubeWorkloadMatcher: namespaces: - default labels: app: reviews
Then, we can send a request to the Gloo Mesh endpoint to get access logs like below:
{ "result": { "workloadRef": { "name": "reviews-v2", "namespace": "default", "clusterName": "cluster1" }, "httpAccessLog": { "commonProperties": { "downstreamRemoteAddress": { "socketAddress": { "address": "10.102.158.19", "portValue": 47198 } }, "downstreamLocalAddress": { "socketAddress": { "address": "10.102.158.25", "portValue": 9080 } }, "tlsProperties": { "tlsVersion": "TLSv1_2", "tlsCipherSuite": 49200, "tlsSniHostname": "outbound_.9080_._.reviews.default.svc.cluster.local", "localCertificateProperties": { "subjectAltName": [ { "uri": "spiffe://cluster1/ns/default/sa/bookinfo-reviews" } ] }, "peerCertificateProperties": { "subjectAltName": [ { "uri": "spiffe://cluster1/ns/default/sa/bookinfo-productpage" } ] } }, "startTime": "2021-03-21T17:33:46.182478Z", "timeToLastRxByte": "0.000062572s", "timeToFirstUpstreamTxByte": "0.000428530s", "timeToLastUpstreamTxByte": "0.000436843s", "timeToFirstUpstreamRxByte": "0.040638581s", "timeToLastUpstreamRxByte": "0.040692768s", "timeToFirstDownstreamTxByte": "0.040671495s", "timeToLastDownstreamTxByte": "0.040708877s", "upstreamRemoteAddress": { "socketAddress": { "address": "127.0.0.1", "portValue": 9080 } }, "upstreamLocalAddress": { "socketAddress": { "address": "127.0.0.1", "portValue": 43078 } }, "upstreamCluster": "inbound|9080||", "metadata": { "filterMetadata": { "istio_authn": { "request.auth.principal": "cluster1/ns/default/sa/bookinfo-productpage", "source.namespace": "default", "source.principal": "cluster1/ns/default/sa/bookinfo-productpage", "source.user": "cluster1/ns/default/sa/bookinfo-productpage" } } }, "routeName": "default", "downstreamDirectRemoteAddress": { "socketAddress": { "address": "10.102.158.19", "portValue": 47198 } } }, "protocolVersion": "HTTP11", "request": { "requestMethod": "GET", "scheme": "http", "authority": "reviews:9080", "path": "/reviews/0", "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36", "requestId": "b0522245-d300-46a4-bfd3-727fcfa42efd", "requestHeadersBytes": "644" }, "response": { "responseCode": 200, "responseHeadersBytes": "1340", "responseBodyBytes": "379", "responseCodeDetails": "via_upstream" } } } }
As you can see, it’s far more detailed than traditional access logs. We can get information about the source and remote IP addresses, the identity of the source and remote services, performance metrics and obviously the traditional HTTP headers.
As shown here, setting up an Istio multi-cluster deployment on Red Hat OpenShift is easy with Gloo Mesh. You can run the latest Istio version, get enterprise support from Solo.io, and operate reliably everywhere. High availability is an important piece of business continuity, so make sure your applications are covered with redundancies and failover. Observability is just as important so you can monitor what is happening and respond quickly to any issues. We hope this blog helped you understand how to gain more resiliency for your applications.
You might also be interested in reading some of our documentation on multi-cluster meshes or troubleshooting. Or request a demo from our Gloo Mesh product page today!
And much more
We can’t cover all the Gloo Mesh features in this blog – there are too many! For example, Gloo Mesh comes with fine-grained role-based access control (RBAC) which allows you to define who can create what kind of policies with what kind of content. It allows you to declaratively deploy Web Assembly filters. Keep exploring our blog for more information on other topics, and let us know if you have any questions!