I noticed a disturbance while deploying applications to my cluster, as I was using PVC to provision volumes. Some volumes cloud not be created or mounted. The hcloud-csi-driver container in the hcloud-csi-node-xxxx logged following errors:

Get https://api.hetzner.cloud/v1/volumes/5127341: dial tcp: i/o timeout"

Why would a connection to api.hetzner.cloud timeout? I investigated both nodes and not every node had that timeout. Why do they behave differently, while being configured exactly the same? Further down the rabbit hole I found this kind of error messages:

Get https://api.hetzner.cloud/v1/volumes/5127341: dial tcp: lookup api.hetzner.cloud on 10.96.0.10:53: read udp 10.244.2.70:55772->10.96.0.10:53: i/o timeout

It looks like this pod cannot connect the the coredns service in the cluster, not resolving api.hetzner.cloud and therefore cannot call the hetzner API to create and attach the volume. As I was restarting those pods and even restarting nodes (kubectl drain <node> first) I noticed, that when only one node is up, this error doesn’t occur that often. And after temporarily removing the master toleration for coredns, therefore running everything in the above call-chain on worker-1, the error is gone completely. This let to the conclusion that the network doesn’t work as expected.

That first sentence on https://github.com/coreos/flannel/blob/master/Documentation/troubleshooting.md:

In Docker v1.13 and later, the default iptables forwarding policy was changed to DROP This problem manifests itself as connectivity problems between containers running on different hosts. To resolve it upgrade to the latest version of flannel.

caught my eye. But even after upgrading flannel to the last version available, it didn’t help. And the issue tracker for flannel shows that I’m not alone with that problem: https://github.com/coreos/flannel/issues?q=is%3Aissue+is%3Aopen+FORWARD

Considering the MetalLB Network Addon Compatibility I had to find a network addon that “just” works. Having Calico, Cilium and Weave as great alternatives, which don’t rely on flannel, there is plenty of choice.

I went with Cilium, because it has an active community and has full compatibility for MetalLB. But I could have choosen Calico or Weave as well, but I had to go with something.

Replacing a CNI plugin/addon will cause a down-time for all pods in your cluster.

# delete the flannel network from the cluster
kubectl delete -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

# on each node and master
rm /etc/cni/net.d/10-flannel.conf

# now restart
reboot

# now deploy Cilium
kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/v1.7.1/install/kubernetes/quick-install.yaml

# after waiting a few seconds (up to a minute or two) the network and all pods come up again.

I tested with both Cilium and WeaveNet and both addons work as far as I can tell right now.