Could network cache-based identity be mistaken?

A few days ago, I published the Exploring Cilium Layer 7 Capabilities Compared to Istio blog where I mentioned network cache-based identity may fail when a pod dies, a new pod is created and gets the IP of the old pod but has a different identity. Thank you everyone for sending me feedback about the blog! In this blog, I would like to demonstrate how identity could be mistaken for network cache-based identity where identities are generated from the network information such as Kubernetes Pod IPs and IPs/identities mappings are cached, thus it could be outdated and mistaken. You’ll see a client who should NOT be able to call a server successfully based on the network policy but is able to call the server because of the wrong identity assigned to the client hence the client is able to bypass the network policy enforcement.

In this experiment, you’ll set up a Kubernetes kind cluster, deploy v1 and v2 of the client applications (sleep) and v1 and v2 of the server applications (helloworld), along with the v1 network policy that allows ONLY the v1 client to call the v1 server, and the v2 network policy that allows ONLY the v2 client to call the v2 server. You’ll first observe the network policies enforced as expected. Then you would trigger an error scenario, along with scale up/down client pods and call the v1 server successfully from the v2 client as the v2 client has the v1 client’s identity. Let us get started!

Setting up the environment

To run the test, you can create a Kubernetes kind cluster with 3 workers, disabling the default CNI per Cilium’s documentation:

kindConfig="
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker
networking:
  disableDefaultCNI: true
"
kind create cluster --config=- <<<"${kindConfig[@]}"

Download the Cilium image:

docker pull quay.io/cilium/cilium:v1.12.0
kind load docker-image quay.io/cilium/cilium:v1.12.0

Install Cilium v1.12 with pod-to-pod encryption enabled with WireGuard:

helm install cilium cilium/cilium --version 1.12.0 \
  --namespace kube-system \
  --set socketLB.enabled=false \
  --set externalIPs.enabled=true \
  --set bpf.masquerade=false \
  --set image.pullPolicy=IfNotPresent \
  --set ipam.mode=kubernetes \
  --set l7Proxy=false \
  --set encryption.enabled=true \
  --set encryption.type=wireguard

Check the cilium pods to ensure they all reached the running status:

kubectl get pod -A | grep cilium

Deploy the applications and network policies

We have two applications, the sleep(client) and helloworld(server) applications. Both the sleep and helloworld have 2 versions, v1 and v2. We also have two simple L4 network policies, where they allow the sleep-v1 to call the helloworld-v1 and the sleep-v2 to call the helloworld-v2. All other calls to the helloworld-v1 or helloworld-v2 should be denied.

 

Clone the repo, then deploy the sleep-v1 and sleep-v2 deployments along with the helloworld-v1 and helloworld-v2 deployments. For sleep-v1 deployment, it has 15 replicas while the sleep-v2 deployment, helloworld-v1, and helloworld-v2 each have 1 replica.

kubectl apply -f yamls/helloworld.yaml
kubectl apply -f yamls/sleep.yaml

Deploy the v1 Cilium L4 network policy. The v1 network policy configures ONLY the sleep-v1 is allowed to call the helloworld-v1. The sleep-v2 should NOT be able to call helloworld-v1.

kubectl apply -f - <<EOF
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
 name: "v1"
spec:
 endpointSelector:
   matchLabels:
     io.cilium.k8s.policy.serviceaccount: helloworld-v1
 ingress:
 - fromEndpoints:
   - matchLabels:
       io.cilium.k8s.policy.serviceaccount: sleep-v1
   toPorts:
   - ports:
     - port: "5000"
       protocol: TCP
EOF

Deploy the v2 Cilium L4 network policy. The v2 network policy configures ONLY the sleep-v2 is allowed to call the helloworld-v2. The sleep-v1 should NOT be able to call helloworld-v2.

kubectl apply -f - <<EOF
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
 name: "v2"
spec:
 endpointSelector:
   matchLabels:
     io.cilium.k8s.policy.serviceaccount: helloworld-v2
 ingress:
 - fromEndpoints:
   - matchLabels:
       io.cilium.k8s.policy.serviceaccount: sleep-v2
   toPorts:
   - ports:
     - port: "5000"
       protocol: TCP
EOF

Can sleep-v2 call helloworld-v1 successfully when it should not be allowed?

With the above applications and network policies deployed, in most cases, network policies will be effective so sleep-v2 will not be able to call helloworld-v1 successfully. Assume all of your sleep and helloworld pods are up running, you can call helloworld-v1 from the sleep-v1 pod and helloworld-v2 from the sleep-v2 pod:

# Show that v1 talks only to v1 and v2 only to v2.
for i in 1 2; do
   for j in 1 2; do
       echo Trying to connect from deploy/sleep-v$i to helloworld-v$j
       (kubectl exec deploy/sleep-v$i -- curl -s http://helloworld-v$j:5000/hello --max-time 2 && echo "Connection success.")|| echo "Connection Failed."
       echo
   done
done

You’ll get outputs as below where only sleep-v1 can call helloworld-v1 and only sleep-v2 can call helloworld-v2, and nothing else. When sleep-v2 calls helloworld-v1, the connection failed.

Use the command below to display the Cilium’s IP cache:

NODEV1=$(kubectl get pod -l app=helloworld,version=v1 -o=jsonpath='{.items[0].spec.nodeName}')
AGENT_POD="$(kubectl get pods -n kube-system --field-selector spec.nodeName="$NODEV1" -l k8s-app=cilium -o custom-columns=:.metadata.name --no-headers)"
kubectl exec -n kube-system $AGENT_POD -c cilium-agent -- cilium map get cilium_ipcache

You’ll see the Cilium’s IP cache similar as below:

Display the sleep-v1’s Cilium identity:

# get the identity of sleep-v1:
SLEEPV1_ID=$(kubectl get ciliumendpoints.cilium.io -l app=sleep,version=v1 -o jsonpath='{.items[0].status.identity.id}')
echo "Security identity for sleep-v1 is $SLEEPV1_ID. CiliumIdentity:"
kubectl get ciliumidentities.cilium.io $SLEEPV1_ID -o yaml

Sample output that shows 13174 is sleep-v1’s Cilium identity:

Prepare the environment to simulate some failure:

In a perfect world when everything works, you’ll observe network policies enforced as above. However, In certain scenarios, sleep-v2 could call helloworld-v1 successfully due to various reasons below:

  • Network outage
  • Cilium pod was down, during an upgrade
  • Resource constraints that cause the agent to be slow
  • Resource constraints that cause API server to be slow in sending pod events to Cilium pod
  • Programming bug that cause the agent to crash
  • Etc…

While these events are not likely, they are not outside the realm of reality. It is possible for sleep-v2 to call helloworld-v1 successfully.

To demonstrate the eventual consistency issue that can cause wrong identity for a pod, let us simulate an “outage” by making the Cilium pod running on the node helloworld-v1 is on to not be able to reach the Kubernetes API server.

# bye bye api server
API_SERVER=$(kubectl get service -n default kubernetes -o=jsonpath='{.spec.clusterIP}')
API_SERVEREP=$(kubectl get endpoints -n default kubernetes -o jsonpath='{.subsets[0].addresses[0].ip}')
docker exec $NODEV1 iptables -t mangle -I INPUT -p tcp -s $API_SERVER -j DROP
docker exec $NODEV1 iptables -t mangle -I INPUT -p tcp -s $API_SERVEREP -j DROP

Review the test script

Let us review the run-test.sh script together before we run the test. First, we record all the IPs used by all of the sleep-v1 pods. Then scale the sleep-v1 deployment to 0 and scale the sleep-v2 deployment to 15 to simulate an environment where pods go up and down rapidly.

# Find the all the IPs of sleep-v1:
SLEEPV1_IPs=$(kubectl get pod -l app=sleep,version=v1 -o json | jq -r '.items[]|.status.podIP')
# scale sleepv1 to 0
kubectl scale deploy sleep-v1 --replicas=0
# scale sleepv2 to 15
kubectl scale deploy sleep-v2 --replicas=15

Then we keep rotating the sleep-v2 pods until one of the sleep-v2 pods gets assigned with the same IP as one of the IPs from the earlier sleep-v1 pods before scaling in.

# rotate sleep-v2 pods until we get a sleep-v2 pod with ip of sleep-v1
while true; do
   # check if we got a click
   for ip in $SLEEPV1_IPs; do
       SLEEPV2POD=$(kubectl get pod -l app=sleep,version=v2 -o json|jq -r '.items|map(select(.status.podIP == "'$ip'"))|map(.metadata.name)|.[0]' )
       if [ -z "$SLEEPV2POD" ] || [ "$SLEEPV2POD" == "null" ]; then
           SLEEPV2POD=""
       else
           echo ""
           echo "Found sleep-v2 pod $SLEEPV2POD ip $ip"
           break
       fi
   done

   if [ -z "$SLEEPV2POD" ]; then
       echo "Matching sleep-v2 not found yet - retrying rollout"
       kubectl rollout restart deployment/sleep-v2
       # rollout status is really verbose, so send output to /dev/null
       kubectl rollout status deployment/sleep-v2 > /dev/null
   else
       break
   fi
done

Now that you have identified your sleep-v2 pod, call helloworld-v1 from the sleep-v2 pod that has the mistaken identity from a sleep-v1 pod earlier by reusing its IP address. This should fail per the v1 network policy, however, it would succeed in the test.

# Try curl from the sleep-v2 pod we found to the helloworld-v1 deployment. This should fail according to policy
# but will succeed because the ip-cache is not up to date.
echo ""
echo "Trying to curl from sleep-v2 to helloworld-v1. running:"
echo kubectl exec $SLEEPV2POD -- curl -s http://helloworld-v1:5000/hello --max-time 2
(kubectl exec $SLEEPV2POD -- curl -s http://helloworld-v1:5000/hello --max-time 2 && echo "Connection success.")|| echo "Connection Failed."
echo ""

Check the Cilium IP Cache to obtain the sleep-v2 pod’s identity from its IP address.

# we can't see the ip-cache with kubectl now because of the "outage" we triggered. So we use docker/crictl instead:
echo "Current ip cache:"
docker exec $NODEV1 /bin/bash -c 'crictl exec $(crictl ps --name cilium-agent -q) cilium map get cilium_ipcache'
echo ""
echo "Specifically, note the identity of the sleep-v2 pod's IP:"
docker exec $NODEV1 /bin/bash -c 'crictl exec $(crictl ps --name cilium-agent -q) cilium map get cilium_ipcache' | grep $ip
echo "and compare to above ip cache and identity."

Run the test!

Simply issue run-test.sh to run the test. You’ll get output like below, the sleep-v2 pod (IP address 10.244.2.65) calls the helloworld-v1 successfully after a few rotate of sleep-v2 pods:

From Cilium’s IP Cache map output below, IP address 10.244.2.65 is associated with identity 13174. If you recall, before you run the test, 13174 is sleep-v1’s Cilium identity.

Display the identity using the kubectl get ciliumidentity 13174 -o yaml command. Per output below, the sleep-v2 pod that has IP address 10.244.2.65 has sleep-v1’s identity 13174. This explains why you can curl helloworld-v1 from the sleep-v2 pod successfully earlier even when the v1 network policy ONLY allows sleep-v1 pod to call helloworld-v1.

 

Take a look at the short video to watch me run the above steps in my test environment:

Wrapping Up

As demonstrated above, network cache-based identity could be mistaken in certain scenarios even with WireGuard encryption enabled. Without a coherent cache (regardless of what has caused it) the identity could be mistaken thus bypassing the network policy. Honestly, this is just a limitation with network-based identity and I would not consider it a bug. While Cilium is used in our testing, it will be the same for any CNI that doesn’t use cryptographic primitives. To achieve defense in depth, you should consider security policies from a service mesh that provides cryptographic identity in addition to L3/L4 network policies.