Agent provisioning fails and the Jenkins logs show something similar to the following:
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.162.0.1/api/v1/namespaces/cloudbees-core/pods. Message: Operation cannot be fulfilled on resourcequotas "<resourceQuotaName>": the object has been modified; please apply your changes to the latest version and try again.
It is more likely to occur when scheduling many pods simultaneously. In Jenkins, this may happen when many agents need to be provisioned at once, i.e. building jobs in bulk or for example using
parallel steps with lots of tasks.
|in GKE, ResourceQuotas are automatically apply to every namespaces under certain conditions and cannot be deleted. See https://cloud.google.com/kubernetes-engine/quotas for more details.|
There are several workarounds that could help to prevent agent failures caused by this issue:
By default, the Jenkins NodeProvisioner makes its decision based on Load statistics and gives acceptable average Queue waiting time results while preventing over provisioning.
There are however provisioning strategies in Jenkins that aim to boost agent provisioning, in the context of Kubernetes:
Those strategies launch agents as soon as they are needed. In a scenario where build tasks are launched in bulk, several agent pods may be scheduled almost simultaneously and provoke this Resource Quotas issue. Whereas the default strategy would give a more gradual behavior.
Therefore a workaround is to disable those provisioning strategies:
NoDelayProvisioningStrategyby adding the system property
-Dio.jenkins.plugins.kubernetes.disableNoDelayProvisioning=trueon controller’s startup
If using CloudBees Core on Modern Cloud Platform, Disable the
KubernetesNodeProvisionerStrategy.by adding the system property
-Dcom.cloudbees.jenkins.plugins.kube.KubernetesNodeProvisionerStrategy.enabled=falseon controller’s startup
This would require a restart of the controller. See How to add Java arguments to Jenkins?.
|Disabling those strategies may result in a "slower" provisioning time overall, that could be negligible depending on the workload of the controller.|
Avoid scheduling too many builds in bulk. Instead of launching hundreds of tasks at once, throttle the scheduling by for example launching tasks in smaller chunks.
If acceptable and possible:
remove the Resource Quotas in the namespace where agents are spun up.
*Note: This might not be possible in GKE, see https://cloud.google.com/kubernetes-engine/quotas
In CloudBees Core on Modern Cloud Platform, the Kube Agent Management plugin sets up a exponential backoff period for agent provisioning failure. Jenkins waits a certain amount of time before retrying to provision a specific agent template.
When hitting that kubernetes issue due to the Resource Quotas, the re-provisioning of the failed agent may be considerably delayed by this backoff period. In environments were this is a problem, an additional workaround is to either disable the backoff period or reduce the maximum backoff time that Jenkins should wait before retrying to provision the same agent.
The maximum backoff period maybe be decreased by adding the system property
Since version 1.1.32 of the Kube Agent Management plugin, the backoff period may be disabled by adding the system property
|This workaround does not prevent Agent failures but prevent delaying the re-provisioning of failed agent.|