Controller/Operations center pods stuck in terminating state

Issue

When worker nodes crash controller/Operations center pods are not moved to a different worker node. Controller/Operations center pods get stuck in terminating /unknown state.

 kubectl get pods -n $NAMESPACE -o wide | grep -i Terminating
ControlerTest1-0 1/1 Terminating 0 7d11h 10.42.17.34 somenode1 <none> <none>
ControlerTest2-0 1/1 Terminating 0 7d11h 10.42.17.35 somenode1 <none> <none>
ControlerTest3-0 1/1 Terminating 0 7d10h 10.42.17.36 somenode1 <none> <none>

Environment

Resolution

This happens when Kubernetes worker node loses connectivity to the API server. Kubernetes (versions 1.5 or newer) will not delete Pods just because a Node is unreachable. The Pods running on an unreachable Node enter the 'Terminating' or 'Unknown' state after a timeout. Pods may also enter these states when the user attempts graceful deletion of a Pod on an unreachable Node. In this case the pod still remains in the API server and hence a new pod is not scheduled since statefulset requires pod maintain a unique id within the cluster.

Workaround

A pod in the terminating state will be removed through below actions. This should allow pod to be rescheduled as any record of it is removed from api server.

The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.
Force deletion of the Pod by the user (kubectl delete pods <pod> --grace-period=0 --force). This should be used as last resort.

References

https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/

This article is part of our Knowledge Base and is provided for guidance-based purposes only. The solutions or workarounds described here are not officially supported by CloudBees and may not be applicable in all environments. Use at your own discretion, and test changes in a safe environment before applying them to production systems.