Cilium Service Mesh in Action

In December of 2021, I participated to the Cilium Service Mesh Beta program, so I was curious to try the service mesh features that have been incorporated in Cilium 1.12. It’s interesting to watch these technologies develop, and even better to dive into them to see what capabilities, features, benefits, and challenges they bring at each stage in their development. In this blog, you can follow along as I walk through installing Cilium 1.12 on GKE with the ingress featured enabled, create an ingress object to expose one service, create a second ingress object to expose another service, then review Layer 7 traffic management to see how this works.

Cilium ingress

We’ll walk through trying this out, and you are welcome to follow along. You can find the documentation for this feature here.

Install Cilium on GKE with the ingress feature enabled

Let’s start by deploying a new GKE Kubernetes cluster to test it out.

gcloud container clusters create cilium \
    --node-taints node.cilium.io/agent-not-ready=true:NoExecute \
    --zone europe-west1-d

Then, I can deploy Cilium with the ingress feature enabled on GKE using the cilium CLI.

cilium install \
    --kube-proxy-replacement=strict \
    --helm-set ingressController.enabled=true

Here is the output:

🔮 Auto-detected Kubernetes kind: GKE
ℹ️  Using Cilium version 1.12.0
🔮 Auto-detected cluster name: gke-solo-test-236622-europe-west1-d-cilium
🔮 Auto-detected datapath mode: gke
✅ Detected GKE native routing CIDR: 10.16.0.0/14
ℹ️  helm template --namespace kube-system cilium cilium/cilium --version 1.12.0 --set cluster.id=0,cluster.name=gke-solo-test-236622-europe-west1-d-cilium,cni.binPath=/home/kubernetes/bin,encryption.nodeEncryption=false,gke.disableDefaultSnat=true,gke.enabled=true,ingressController.enabled=true,ipam.mode=kubernetes,ipv4NativeRoutingCIDR=10.16.0.0/14,kubeProxyReplacement=strict,nodeinit.enabled=true,nodeinit.reconfigureKubelet=true,nodeinit.removeCbrBridge=true,operator.replicas=1,serviceAccounts.cilium.name=cilium,serviceAccounts.operator.name=cilium-operator
ℹ️  Storing helm values file in kube-system/cilium-cli-helm-values Secret
🚀 Creating Resource quotas...
🔑 Created CA in secret cilium-ca
🔑 Generating certificates for Hubble...
🚀 Creating Service accounts...
🚀 Creating Cluster roles...
🚀 Creating ConfigMap for Cilium version 1.12.0...
🚀 Creating GKE Node Init DaemonSet...
🚀 Creating Agent DaemonSet...
🚀 Creating Operator Deployment...
⌛ Waiting for Cilium to be installed and ready...
✅ Cilium was successfully installed! Run 'cilium status' to view installation health

Let’s have a look at the different pods running in my cluster:

NAMESPACE     NAME                                               READY   STATUS    RESTARTS   AGE
kube-system   cilium-dlcxr                                       1/1     Running   0          2m43s
kube-system   cilium-lb5hw                                       1/1     Running   0          2m43s
kube-system   cilium-node-init-jqp5x                             1/1     Running   0          2m43s
kube-system   cilium-node-init-r2vn8                             1/1     Running   0          2m43s
kube-system   cilium-node-init-z799q                             1/1     Running   0          2m43s
kube-system   cilium-operator-598c495f5f-7w8k9                   1/1     Running   0          2m43s
kube-system   cilium-vnw5c                                       1/1     Running   0          2m43s
kube-system   event-exporter-gke-5479fd58c8-zglht                2/2     Running   0          3m45s
kube-system   fluentbit-gke-fg7nz                                2/2     Running   0          3m4s
kube-system   fluentbit-gke-m5hrx                                2/2     Running   0          3m4s
kube-system   fluentbit-gke-msjch                                2/2     Running   0          3m5s
kube-system   gke-metrics-agent-9gq9b                            1/1     Running   0          3m4s
kube-system   gke-metrics-agent-lwkqr                            1/1     Running   0          3m4s
kube-system   gke-metrics-agent-qbp7m                            1/1     Running   0          3m5s
kube-system   konnectivity-agent-78df777b57-hz9rv                1/1     Running   0          3m39s
kube-system   konnectivity-agent-78df777b57-m268w                1/1     Running   0          106s
kube-system   konnectivity-agent-78df777b57-zbx6b                1/1     Running   0          106s
kube-system   konnectivity-agent-autoscaler-555f599d94-hjvwl     1/1     Running   0          3m37s
kube-system   kube-dns-56494768b7-nnxvf                          4/4     Running   0          102s
kube-system   kube-dns-56494768b7-p54dr                          4/4     Running   0          3m50s
kube-system   kube-dns-autoscaler-f4d55555-qkcg9                 1/1     Running   0          3m50s
kube-system   kube-proxy-gke-cilium-default-pool-21a8e3bd-711h   1/1     Running   0          2m30s
kube-system   kube-proxy-gke-cilium-default-pool-21a8e3bd-fznt   1/1     Running   0          2m24s
kube-system   kube-proxy-gke-cilium-default-pool-21a8e3bd-gz19   1/1     Running   0          2m25s
kube-system   l7-default-backend-69fb9fd9f9-j9b9f                1/1     Running   0          3m35s
kube-system   metrics-server-v0.4.5-bbb794dcc-27s88              2/2     Running   0          87s
kube-system   pdcsi-node-2h6q8                                   2/2     Running   0          3m4s
kube-system   pdcsi-node-985tl                                   2/2     Running   0          3m4s
kube-system   pdcsi-node-bsdh9                                   2/2     Running   0          3m5s

There’s no difference in term of what’s being deployed when the Ingress feature is enabled.

Next, let’s have a look at the Kubernetes services:

NAMESPACE     NAME          TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                  AGE
kube-system   default-http-backend             NodePort       10.118.85.221   <none>        80:32731/TCP    4m37s
kube-system   kube-dns                         ClusterIP      10.118.80.10    <none>        53/UDP,53/TCP   4m53s
kube-system   metrics-server                   ClusterIP      10.118.89.169   <none>        443/TCP         4m14s

Again, nothing is different from a standard Cilium deployment. I was expecting a Kubernetes service for my ingresses, so let’s see how it goes when an ingress object is created. After that, deploy the Istio bookinfo demo application.

kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.14/samples/bookinfo/platform/kube/bookinfo.yaml

Create an ingress object to expose one service

I’ll create my first ingress object using the cilium class to expose the details service.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: details-ingress
  namespace: default
spec:
  ingressClassName: cilium
  rules:
    - http:
        paths:
          - backend:
              service:
                name: details
                port:
                  number: 9080
            path: /details
            pathType: Prefix

I can see that a new Kubernetes service has been created:

NAMESPACE     NAME                             TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)         AGE
default       cilium-ingress-details-ingress   LoadBalancer   10.118.93.239   <pending>     80:30463/TCP    30s

It’s a LoadBalancer service, which means that it will trigger the creation of a Google Cloud network load balancer. If I wait a little bit more, I can see that an external IP is now assigned to this service:

NAMESPACE     NAME                             TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)         AGE
default       cilium-ingress-details-ingress   LoadBalancer   10.118.93.239   34.140.121.201   80:30463/TCP    5m19s

I should now be able to access the details service through this external IP:

export EXTERNAL_IP=$(kubectl get svc cilium-ingress-details-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

curl "http://${EXTERNAL_IP}/details/1"

So what happened here behind the scenes? First of all, let’s have a look at the Cilium logs:

kubectl -n kube-system logs -l k8s-app=cilium | grep cilium

Here is the output:

level=info msg="[lds: add/update listener 'cilium-ingress-default-details-ingress'" subsys=envoy-upstream threadID=81
level=info msg="Adding new proxy port rules for cilium-ingress-default-details-ingress:13252" proxy port name=cilium-ingress-default-details-ingress subsys=proxy
level=info msg="Adding new proxy port rules for cilium-ingress-default-details-ingress:14878" proxy port name=cilium-ingress-default-details-ingress subsys=proxy
level=info msg="Adding new proxy port rules for cilium-ingress-default-details-ingress:10390" proxy port name=cilium-ingress-default-details-ingress subsys=proxy

It has triggered the creation of a CiliumEnvoyConfig Kubernetes object:

apiVersion: cilium.io/v2
kind: CiliumEnvoyConfig
metadata:
  creationTimestamp: "2022-07-29T12:45:28Z"
  generation: 1
  name: cilium-ingress-default-details-ingress
  namespace: default
  ownerReferences:
  - apiVersion: networking.k8s.io/v1
    kind: Ingress
    name: details-ingress
    uid: 33b25cfa-7ab7-491f-bc25-d5c161d156fc
  resourceVersion: "3426"
  uid: 0e887959-cb8c-45cd-8e99-9486bc50ab6e
spec:
  backendServices:
  - name: details
    namespace: default
    number:
    - "9080"
  resources:
  - '@type': type.googleapis.com/envoy.config.listener.v3.Listener
    filterChains:
    - filterChainMatch:
        transportProtocol: raw_buffer
      filters:
      - name: envoy.filters.network.http_connection_manager
        typedConfig:
          '@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          httpFilters:
          - name: envoy.filters.http.router
          rds:
            routeConfigName: cilium-ingress-default-details-ingress_route
          statPrefix: cilium-ingress-default-details-ingress
    listenerFilters:
    - name: envoy.filters.listener.tls_inspector
    name: cilium-ingress-default-details-ingress
    socketOptions:
    - description: Enable TCP keep-alive, annotation io.cilium/tcp-keep-alive. (default
        to enabled)
      intValue: "1"
      level: "1"
      name: "9"
      state: STATE_LISTENING
    - description: TCP keep-alive idle time (in seconds). Annotation io.cilium/tcp-keep-alive-idle
        (defaults to 10s)
      intValue: "10"
      level: "6"
      name: "4"
      state: STATE_LISTENING
    - description: TCP keep-alive probe intervals (in seconds). Annotation io.cilium/tcp-keep-alive-probe-interval
        (defaults to 5s)
      intValue: "5"
      level: "6"
      name: "5"
      state: STATE_LISTENING
    - description: TCP keep-alive probe max failures. Annotation io.cilium/tcp-keep-alive-probe-max-failures
        (defaults to 10)
      intValue: "10"
      level: "6"
      name: "6"
      state: STATE_LISTENING
  - '@type': type.googleapis.com/envoy.config.route.v3.RouteConfiguration
    name: cilium-ingress-default-details-ingress_route
    virtualHosts:
    - domains:
      - '*'
      name: '*'
      routes:
      - match:
          safeRegex:
            googleRe2: {}
            regex: /details(/.*)?$
        route:
          cluster: default/details:9080
          maxStreamDuration:
            maxStreamDuration: 0s
  - '@type': type.googleapis.com/envoy.config.cluster.v3.Cluster
    connectTimeout: 5s
    name: default/details:9080
    outlierDetection:
      consecutiveLocalOriginFailure: 2
      splitExternalLocalOriginErrors: true
    type: EDS
    typedExtensionProtocolOptions:
      envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
        '@type': type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
        useDownstreamProtocolConfig:
          http2ProtocolOptions: {}
  services:
  - listener: cilium-ingress-default-details-ingress
    name: cilium-ingress-details-ingress
    namespace: default

This contains a raw Envoy configuration which is then passed to the Envoy process running in the Cilium pods.

We can take a look at the Envoy config dump using the following commands:

cilium=$(kubectl -n kube-system get pods -l k8s-app=cilium -o jsonpath='{.items[0].metadata.name}')
kubectl -n kube-system exec -q $cilium -- apt update
kubectl -n kube-system exec -q $cilium -- apt -y install curl
kubectl -n kube-system exec -q $cilium -- curl -s --unix-socket /var/run/cilium/envoy-admin.sock http://localhost/config_dump

It contains the configuration we’ve seen in the CiliumEnvoyConfig Kubernetes object.

Create a second ingress object to expose another service

Now, let’s create another ingress object to expose the reviews service.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: reviews-ingress
  namespace: default
spec:
  ingressClassName: cilium
  rules:
    - http:
        paths:
          - backend:
              service:
                name: reviews
                port:
                  number: 9080
            path: /reviews
            pathType: Prefix

I can see that a new Kubernetes service has been created tor this ingress:

NAMESPACE     NAME                             TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)         AGE
default       cilium-ingress-details-ingress   LoadBalancer   10.118.93.239   34.140.121.201   80:30463/TCP    53m
default       cilium-ingress-reviews-ingress   LoadBalancer   10.118.90.144   35.187.118.187   80:30680/TCP    93s

I should now be able to access the reviews service through this external IP:

export EXTERNAL_IP=$(kubectl get svc cilium-ingress-reviews-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

curl "http://${EXTERNAL_IP}/reviews/1"

As you can see, a new Kubernetes service (so a new Cloud load balancer) is created for each ingress resource, which may not be something you’d expect.

L7 traffic management

Another functionality that is part of the Cilium service mesh features is L7 traffic management. Explaining it is quite straightforward because the features mainly consist of letting a user create CiliumEnvoyConfig objects to apply raw Envoy configuration. Here, what I want to achieve is sending 90% of the requests to the reviews-v1 pods and 10% to reviews-v2 pods when any pod sends a request to the reviews Kubernetes service.

 

In order to accomplish this, I took a look at the example in this documentation to figure out how to reach my goal.

I started with the following Kubernetes object:

apiVersion: cilium.io/v2
kind: CiliumEnvoyConfig
metadata:
  name: envoy-lb-listener
spec:
  services:
    - name: reviews
      namespace: default
  resources:
    - "@type": type.googleapis.com/envoy.config.listener.v3.Listener
      name: envoy-lb-listener
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: envoy-lb-listener
                rds:
                  route_config_name: lb_route
                http_filters:
                  - name: envoy.filters.http.router
    - "@type": type.googleapis.com/envoy.config.route.v3.RouteConfiguration
      name: lb_route
      virtual_hosts:
        - name: "lb_route"
          domains: [ "*" ]
          routes:
            - match:
                prefix: "/"
              route:
                weighted_clusters:
                  clusters:
                    - name: "default/reviews-v1"
                      weight: 90
                    - name: "default/reviews-v2"
                      weight: 10
                retry_policy:
                  retry_on: 5xx
                  num_retries: 3
                  per_try_timeout: 1s
    - "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
      name: "default/reviews-v1"
      connect_timeout: 5s
      lb_policy: ROUND_ROBIN
      type: EDS
      outlier_detection:
        split_external_local_origin_errors: true
        consecutive_local_origin_failure: 2
    - "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
      name: "default/reviews-v2"
      connect_timeout: 3s
      lb_policy: ROUND_ROBIN
      type: EDS
      outlier_detection:
        split_external_local_origin_errors: true
        consecutive_local_origin_failure: 2

Next, check to see if it works using the details pod:

kubectl exec -it deploy/ratings-v1 -- curl reviews:9080/reviews/0

I got the following error:

no healthy upstream

To learn more, let’s have a look at the Envoy clusters. I didn’t find information about how to display the Envoy clusters in the Cilium documentation, but I’ve found a way to get it using the following commands.

pod=$(kubectl -n kube-system get pods -l k8s-app=cilium -o jsonpath='{.items[0].metadata.name}')
kubectl -n kube-system exec -q ${pod} -- curl -s --unix-socket /var/run/cilium/envoy-admin.sock http://localhost/clusters | grep reviews

Here is the output:

default/reviews-v2::observability_name::default/reviews-v2
default/reviews-v2::outlier::success_rate_average::-1
default/reviews-v2::outlier::success_rate_ejection_threshold::-1
default/reviews-v2::outlier::local_origin_success_rate_average::-1
default/reviews-v2::outlier::local_origin_success_rate_ejection_threshold::-1
default/reviews-v2::default_priority::max_connections::1024
default/reviews-v2::default_priority::max_pending_requests::1024
default/reviews-v2::default_priority::max_requests::1024
default/reviews-v2::default_priority::max_retries::3
default/reviews-v2::high_priority::max_connections::1024
default/reviews-v2::high_priority::max_pending_requests::1024
default/reviews-v2::high_priority::max_requests::1024
default/reviews-v2::high_priority::max_retries::3
default/reviews-v2::added_via_api::true
default/reviews-v1::observability_name::default/reviews-v1
default/reviews-v1::outlier::success_rate_average::-1
default/reviews-v1::outlier::success_rate_ejection_threshold::-1
default/reviews-v1::outlier::local_origin_success_rate_average::-1
default/reviews-v1::outlier::local_origin_success_rate_ejection_threshold::-1
default/reviews-v1::default_priority::max_connections::1024
default/reviews-v1::default_priority::max_pending_requests::1024
default/reviews-v1::default_priority::max_requests::1024
default/reviews-v1::default_priority::max_retries::3
default/reviews-v1::high_priority::max_connections::1024
default/reviews-v1::high_priority::max_pending_requests::1024
default/reviews-v1::high_priority::max_requests::1024
default/reviews-v1::high_priority::max_retries::3
default/reviews-v1::added_via_api::true
default/reviews:9080::observability_name::default/reviews:9080
default/reviews:9080::outlier::success_rate_average::-1
default/reviews:9080::outlier::success_rate_ejection_threshold::-1
default/reviews:9080::outlier::local_origin_success_rate_average::-1
default/reviews:9080::outlier::local_origin_success_rate_ejection_threshold::-1
default/reviews:9080::default_priority::max_connections::1024
default/reviews:9080::default_priority::max_pending_requests::1024
default/reviews:9080::default_priority::max_requests::1024
default/reviews:9080::default_priority::max_retries::3
default/reviews:9080::high_priority::max_connections::1024
default/reviews:9080::high_priority::max_pending_requests::1024
default/reviews:9080::high_priority::max_requests::1024
default/reviews:9080::high_priority::max_retries::3
default/reviews:9080::added_via_api::true
default/reviews:9080::10.16.2.100:9080::cx_active::0
default/reviews:9080::10.16.2.100:9080::cx_connect_fail::0
default/reviews:9080::10.16.2.100:9080::cx_total::0
default/reviews:9080::10.16.2.100:9080::rq_active::0
default/reviews:9080::10.16.2.100:9080::rq_error::0
default/reviews:9080::10.16.2.100:9080::rq_success::0
default/reviews:9080::10.16.2.100:9080::rq_timeout::0
default/reviews:9080::10.16.2.100:9080::rq_total::0
default/reviews:9080::10.16.2.100:9080::hostname::
default/reviews:9080::10.16.2.100:9080::health_flags::healthy
default/reviews:9080::10.16.2.100:9080::weight::1
default/reviews:9080::10.16.2.100:9080::region::
default/reviews:9080::10.16.2.100:9080::zone::
default/reviews:9080::10.16.2.100:9080::sub_zone::
default/reviews:9080::10.16.2.100:9080::canary::false
default/reviews:9080::10.16.2.100:9080::priority::0
default/reviews:9080::10.16.2.100:9080::success_rate::-1.0
default/reviews:9080::10.16.2.100:9080::local_origin_success_rate::-1.0
default/reviews:9080::10.16.0.199:9080::cx_active::1
default/reviews:9080::10.16.0.199:9080::cx_connect_fail::0
default/reviews:9080::10.16.0.199:9080::cx_total::1
default/reviews:9080::10.16.0.199:9080::rq_active::0
default/reviews:9080::10.16.0.199:9080::rq_error::0
default/reviews:9080::10.16.0.199:9080::rq_success::1
default/reviews:9080::10.16.0.199:9080::rq_timeout::0
default/reviews:9080::10.16.0.199:9080::rq_total::1
default/reviews:9080::10.16.0.199:9080::hostname::
default/reviews:9080::10.16.0.199:9080::health_flags::healthy
default/reviews:9080::10.16.0.199:9080::weight::1
default/reviews:9080::10.16.0.199:9080::region::
default/reviews:9080::10.16.0.199:9080::zone::
default/reviews:9080::10.16.0.199:9080::sub_zone::
default/reviews:9080::10.16.0.199:9080::canary::false
default/reviews:9080::10.16.0.199:9080::priority::0
default/reviews:9080::10.16.0.199:9080::success_rate::-1.0
default/reviews:9080::10.16.0.199:9080::local_origin_success_rate::-1.0
default/reviews:9080::10.16.0.229:9080::cx_active::0
default/reviews:9080::10.16.0.229:9080::cx_connect_fail::0
default/reviews:9080::10.16.0.229:9080::cx_total::0
default/reviews:9080::10.16.0.229:9080::rq_active::0
default/reviews:9080::10.16.0.229:9080::rq_error::0
default/reviews:9080::10.16.0.229:9080::rq_success::0
default/reviews:9080::10.16.0.229:9080::rq_timeout::0
default/reviews:9080::10.16.0.229:9080::rq_total::0
default/reviews:9080::10.16.0.229:9080::hostname::
default/reviews:9080::10.16.0.229:9080::health_flags::healthy
default/reviews:9080::10.16.0.229:9080::weight::1
default/reviews:9080::10.16.0.229:9080::region::
default/reviews:9080::10.16.0.229:9080::zone::
default/reviews:9080::10.16.0.229:9080::sub_zone::
default/reviews:9080::10.16.0.229:9080::canary::false
default/reviews:9080::10.16.0.229:9080::priority::0
default/reviews:9080::10.16.0.229:9080::success_rate::-1.0
default/reviews:9080::10.16.0.229:9080::local_origin_success_rate::-1.0

From this, I can see that Cilium hasn’t associated the reviews-v1 and reviews-v2 Envoy clusters with any endpoint.

Then I found the following statement in the troubleshooting guide:

The Envoy Discovery Service (EDS) has a name that follows the convention <namespace>/<service-name>:<port>.

I was expecting it to follow the convention <namespace>/<deployment-name>:<port> instead.

So, the only option I have is to define a different Kubernetes service for each version:

apiVersion: v1
kind: Service
metadata:
  name: reviews-v1
  labels:
    app: reviews
    service: reviews
    version: v1
spec:
  ports:
  - port: 9080
    name: http
  selector:
    app: reviews
    version: v1
---
apiVersion: v1
kind: Service
metadata:
  name: reviews-v2
  labels:
    app: reviews
    service: reviews
    version: v2
spec:
  ports:
  - port: 9080
    name: http
  selector:
    app: reviews
    version: v2

However, I still get the same error and no endpoint associated with the new Kubernetes services.

After that, I realized the the CiliumEnvoyConfig object created for the Ingress object was referencing backendServices.

So, I took a look at the CRD definition and found the purpose of this option:

BackendServices specifies Kubernetes services whose backends are automatically synced to Envoy using EDS. Traffic for these services is not forwarded to an Envoy listener. This allows an Envoy listener load balance traffic to these backends while normal Cilium service load balancing takes care of balancing traffic for these services at the same time. 

Great, this looks like exactly what I need!

Let’s update the CiliumEnvoyConfig object as follows:

apiVersion: cilium.io/v2
kind: CiliumEnvoyConfig
metadata:
  name: envoy-lb-listener
spec:
  services:
    - name: reviews
      namespace: default
  backendServices:
    - name: reviews-v1
      namespace: default
    - name: reviews-v2
      namespace: default
  resources:
    - "@type": type.googleapis.com/envoy.config.listener.v3.Listener
      name: envoy-lb-listener
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: envoy-lb-listener
                rds:
                  route_config_name: lb_route
                http_filters:
                  - name: envoy.filters.http.router
    - "@type": type.googleapis.com/envoy.config.route.v3.RouteConfiguration
      name: lb_route
      virtual_hosts:
        - name: "lb_route"
          domains: [ "*" ]
          routes:
            - match:
                prefix: "/"
              route:
                weighted_clusters:
                  clusters:
                    - name: "default/reviews-v1"
                      weight: 90
                    - name: "default/reviews-v2"
                      weight: 10
                retry_policy:
                  retry_on: 5xx
                  num_retries: 3
                  per_try_timeout: 1s
    - "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
      name: "default/reviews-v1"
      connect_timeout: 5s
      lb_policy: ROUND_ROBIN
      type: EDS
      outlier_detection:
        split_external_local_origin_errors: true
        consecutive_local_origin_failure: 2
    - "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
      name: "default/reviews-v2"
      connect_timeout: 3s
      lb_policy: ROUND_ROBIN
      type: EDS
      outlier_detection:
        split_external_local_origin_errors: true
        consecutive_local_origin_failure: 2

And now it works! Ninety percent of the time I get the following output:

{"id": "0","podname": "reviews-v1-55b668fc65-wggz8","clustername": "null","reviews": [{  "reviewer": "Reviewer1",  "text": "An extremely entertaining play by Shakespeare. The slapstick humour is refreshing!"},{  "reviewer": "Reviewer2",  "text": "Absolutely fun and entertaining. The play lacks thematic depth when compared to other plays by Shakespeare."}]}

It wasn’t easy, but I’ve finally found the right syntax to achieve my goal!

A couple of other issues I went through:

  • Many components in the Envoy configuration must of have unique names across all the CiliumEnvoyConfig. The listener names and the route names, for example. Otherwise, strange behaviors occur, like a route of one CiliumEnvoyConfig used by another CiliumEnvoyConfig.
  • When I submit a new CiliumEnvoyConfig, I didn’t know if it was accepted or not. I had to check in the Cilium Operator logs to check if the object has been rejected.

Summary

Let’s start with the ingress controller.

Having a different Kubernetes service (so, a new Cloud load balancer) created for each ingress would require creation of multiple ingress objects by multiple application teams along with many Cloud load balancers. It would cost a lot of money and it would become very complicated for them to share the same domain name. Also, currently it doesn’t really support any Kubernetes annotations, so you can’t do much with it.

All the features you find in other Kubernetes ingresses could be implemented in Cilium. However, it took other companies and communities years to do it right. There are many options out there, including powerful Cloud Native API gateways.

What about the L7 traffic management features ?

I firmly believe that Envoy isn’t supposed to be configured directly by humans, but by a control plane. Despite the fact that I have 5 years of experience working with Envoy and have built the Envoy UI tool to allow people to understand Envoy configurations, it took me a lot of time and effort to find the right syntax to achieve my fairly simple goal. So, I think asking users to provide raw Envoy configuration isn’t a good idea. It’s complex to find the right syntax and having several users submitting their own configurations will quickly generate conflicts that will be nearly impossible to troubleshoot.

I am also concerned about many applications are sharing the same Envoy instance for L7 traffic management (noisy neighbors, scalability, and so on). While it is possible for a new control plane could be built from scratch, it does take years to get right.

I love Cilium as a CNI and all the eBPF and performance optimization Cilium provides for L3/L4 capabilities.

Learn more about Gloo Network for Cilium.