Kubernetes Cluster Doomsday
Posted on
Kubernetes has a ticking time bomb, that is the cluster's certificate. Kubernetes clusters work on top of TLS and rely on PKI certificates for authentication over TLS.
The Kubernetes cluster certificates have a lifespan of one year. If the Kubernetes cluster certificate expires on the Kubernetes master, then the kubelet service will fail. Issuing a kubectl command, such as kubectl get pods
or kubectl exec -it container_name bash
, will result in a message similar to Unable to connect to the server: x509: certificate has expired or is not yet valid.
I already know about that, and renew our cluster cert at Mar 16, 2020 04:20 UTC
A. Nightmare is coming
But at 2020-08-09 18.26 (+0700) our cluster was down, every service was down. Try issuing kubectl command to check the cluster
$ kubectl get node -o wide
Unable to connect to the server: x509: certificate has expired or is not yet valid
Check the kubelet log
$ sudo journalctl -fu kubelet
août 09 13:46:38 KubeMaster kubelet[16214]: E0809 13:46:38.476674 16214 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://192.168.140.66:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dkubemaster&limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
août 09 13:46:38 KubeMaster kubelet[16214]: E0809 13:46:38.532150 16214 kubelet.go:2248] node "kubemaster" not found
Hey..what happen, why our cluster certificates are expired, I already renew them on March 2020. Let's check our cluster certificates:
sumar@KubeMaster:~$ sudo kubeadm alpha certs check-expiration
[sudo] Mot de passe de sumar :
CERTIFICATE EXPIRES RESIDUAL TIME EXTERNALLY MANAGED
admin.conf Mar 04, 2021 04:20 UTC 218d no
apiserver Mar 04, 2021 04:20 UTC 218d no
apiserver-etcd-client Mar 04, 2021 04:20 UTC 218d no
apiserver-kubelet-client Mar 04, 2021 04:20 UTC 218d no
controller-manager.conf Mar 04, 2021 04:20 UTC 218d no
etcd-healthcheck-client Mar 04, 2021 04:20 UTC 218d no
etcd-peer Mar 04, 2021 04:20 UTC 218d no
etcd-server Mar 04, 2021 04:20 UTC 218d no
front-proxy-client Mar 04, 2021 04:20 UTC 218d no
scheduler.conf Mar 04, 2021 04:20 UTC 218d no
They are not expired yet. Shit, what happen.
B. Root cause
Debugging kubernetes clusters is a pain in the ass. But at least I get what the real issue here. The root of all the problem is that the kubernetes api container does not pick our new cert. Hell yeah!
C. Wake up from a nightmare
The solution that save my ass:
Regenerate all cluster Cert
$ kubeadm alpha certs renew all certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed certificate for serving the Kubernetes API renewed certificate the apiserver uses to access etcd renewed certificate for the API server to connect to kubelet renewed certificate embedded in the kubeconfig file for the controller manager to use renewed certificate for liveness probes to healtcheck etcd renewed certificate for etcd nodes to communicate with each other renewed certificate for serving etcd renewed certificate for the front proxy client renewed certificate embedded in the kubeconfig file for the scheduler manager to use renewed
Stop kubelet service of all nodes
$ sudo systemctl stop kubelet
Stop docker service of all nodes
$ sudo systemctl stop docker
Start docker service on master node
$ sudo systemctl start docker
Start kubelet service on master node
$ sudo systemctl start kubelet
Start docker service on every worker node
$ sudo systemctl start docker
Start kubelet service on every worker node
$ sudo systemctl start kubelet
Monitor kubelet log on all node
$ sudo journalctl -fu kubelet
Why not use restart instead of shutdown? I try restart command but not work, kubelet still not pick new cert.
And now Wow, the log start back to normal state. All error log related to certificates is gone. But wait, not that fast. New error was coming, our calico pod was unable to start so every pod can not communicate each other. The solution is delete the pod, so kubernetes will recreate the calico pod.
Total downtime that caused by this issue is about 1 hour 10 minutes.