Issue
-
A large volume of pods are waiting to be scheduled and waiting in the queue, pods going into failed and pending status :
Pod randomPodName marked as unschedulable can be scheduled on ip-XXX-XX-XX-XX.ec2.internal. Ignoring in scale up."
Explanation
This is caused by an issue in Kubernetes and the cluster autoscaler prior to version 1.11.7
Resolution
Kubernetes versions affected
-
For a temporary workaround the node reported can be tainted so kubernetes no longer schedules jobs on this node Setting a taint to the node will prevent new pods from being scheduled there and after few seconds, the autoscaler should start scaling things properly.
-
Long term resolution it is recommended to upgrade to Kubernetes 1.11.7 and Cluster Autoscaler supported version following the guidelines listed in the related issue. Autoscaler fails to scale up nodes with pending pods