I noticed a disturbance while deploying applications to my cluster, as I was using PVC to provision volumes. Some volumes
cloud not be created or mounted. The
hcloud-csi-driver container in the
hcloud-csi-node-xxxx logged following errors:
Get https://api.hetzner.cloud/v1/volumes/5127341: dial tcp: i/o timeout"
Why would a connection to api.hetzner.cloud timeout? I investigated both nodes and not every node had that timeout. Why do they behave differently, while being configured exactly the same? Further down the rabbit hole I found this kind of error messages:
Get https://api.hetzner.cloud/v1/volumes/5127341: dial tcp: lookup api.hetzner.cloud on 10.96.0.10:53: read udp 10.244.2.70:55772->10.96.0.10:53: i/o timeout
It looks like this pod cannot connect the the
coredns service in the cluster, not resolving
and therefore cannot call the hetzner API to create and attach the volume.
As I was restarting those pods and even restarting nodes (
kubectl drain <node> first) I noticed, that when only one
node is up, this error doesn't occur that often. And after temporarily removing the master toleration for coredns,
therefore running everything in the above call-chain on worker-1, the error is gone completely.
This let to the conclusion that the network doesn't work as expected.
That first sentence on https://github.com/coreos/flannel/blob/master/Documentation/troubleshooting.md:
In Docker v1.13 and later, the default iptables forwarding policy was changed to
DROPThis problem manifests itself as connectivity problems between containers running on different hosts. To resolve it upgrade to the latest version of flannel.
caught my eye. But even after upgrading flannel to the last version available, it didn't help. And the issue tracker for flannel shows that I'm not alone with that problem: https://github.com/coreos/flannel/issues?q=is%3Aissue+is%3Aopen+FORWARD
Considering the MetalLB Network Addon Compatibility I had to find a network addon that “just” works. Having Calico, Cilium and Weave as great alternatives, which don't rely on flannel, there is plenty of choice.
I went with Cilium, because it has an active community and has full compatibility for MetalLB. But I could have choosen Calico or Weave as well, but I had to go with something.
Replacing a CNI plugin/addon will cause a down-time for all pods in your cluster.
# delete the flannel network from the cluster kubectl delete -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml # on each node and master rm /etc/cni/net.d/10-flannel.conf # now restart reboot # now deploy Cilium kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/v1.6.5/install/kubernetes/quick-install.yaml # after waiting a few seconds (up to a minute or two) the network and all pods come up again.
I tested with both Cilium and WeaveNet and both addons work as far as I can tell right now.