Controller/Operations center pods stuck in terminating state

Article ID:360059046072
1 minute readKnowledge base

Issue

When worker nodes crash controller/Operations center pods are not moved to a different worker node. Controller/Operations center pods get stuck in terminating /unknown state.

 kubectl get pods -n $NAMESPACE -o wide | grep -i Terminating
ControlerTest1-0 1/1 Terminating 0 7d11h 10.42.17.34 somenode1 <none> <none>
ControlerTest2-0 1/1 Terminating 0 7d11h 10.42.17.35 somenode1 <none> <none>
ControlerTest3-0 1/1 Terminating 0 7d10h 10.42.17.36 somenode1 <none> <none>

Resolution

This happens when Kubernetes worker node loses connectivity to the API server. Kubernetes (versions 1.5 or newer) will not delete Pods just because a Node is unreachable. The Pods running on an unreachable Node enter the 'Terminating' or 'Unknown' state after a timeout. Pods may also enter these states when the user attempts graceful deletion of a Pod on an unreachable Node. In this case the pod still remains in the API server and hence a new pod is not scheduled since statefulset requires pod maintain a unique id within the cluster.

Workaround

A pod in the terminating state will be removed through below actions. This should allow pod to be rescheduled as any record of it is removed from api server.

  • The Node object is deleted (either by you, or by the Node Controller).

  • The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.

  • Force deletion of the Pod by the user (kubectl delete pods <pod> --grace-period=0 --force). This should be used as last resort.