Kubernetes Agent Provisioning may fail when using ResourceQuotas

Article ID:360042998592
3 minute readKnowledge base

Issue

  • Agent provisioning fails and the Jenkins logs show something similar to the following:

io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.162.0.1/api/v1/namespaces/cloudbees-core/pods. Message: Operation cannot be fulfilled on resourcequotas "<resourceQuotaName>": the object has been modified; please apply your changes to the latest version and try again.

Explanation

This problem may occurs when using ResourceQuotas. This is a known issue in kubernetes #67761 related to the way the kube-controller refreshed Resource Quotas once pods are scheduled.

It is more likely to occur when scheduling many pods simultaneously. In Jenkins, this may happen when many agents need to be provisioned at once, i.e. building jobs in bulk or for example using parallel steps with lots of tasks.

in GKE, ResourceQuotas are automatically apply to every namespaces under certain conditions and cannot be deleted. See https://cloud.google.com/kubernetes-engine/quotas for more details.

Resolution

There are several workarounds that could help to prevent agent failures caused by this issue:

Workaround 1

By default, the Jenkins NodeProvisioner makes its decision based on Load statistics and gives acceptable average Queue waiting time results while preventing over provisioning.

There are however provisioning strategies in Jenkins that aim to boost agent provisioning, in the context of Kubernetes:

Those strategies launch agents as soon as they are needed. In a scenario where build tasks are launched in bulk, several agent pods may be scheduled almost simultaneously and provoke this Resource Quotas issue. Whereas the default strategy would give a more gradual behavior.

Therefore a workaround is to disable those provisioning strategies:

  • Disable the NoDelayProvisioningStrategy by adding the system property -Dio.jenkins.plugins.kubernetes.disableNoDelayProvisioning=true on controller’s startup

  • If using CloudBees Core on Modern Cloud Platform, Disable the KubernetesNodeProvisionerStrategy. by adding the system property -Dcom.cloudbees.jenkins.plugins.kube.KubernetesNodeProvisionerStrategy.enabled=false on controller’s startup

This would require a restart of the controller. See How to add Java arguments to Jenkins?.

Disabling those strategies may result in a "slower" provisioning time overall, that could be negligible depending on the workload of the controller.

Workaround 2

Avoid scheduling too many builds in bulk. Instead of launching hundreds of tasks at once, throttle the scheduling by for example launching tasks in smaller chunks.

Workaround 3

If acceptable and possible:

  • remove the Resource Quotas in the namespace where agents are spun up.

*Note: This might not be possible in GKE, see https://cloud.google.com/kubernetes-engine/quotas

Reduce / Disable Backoff period (CloudBees Core on Modern Cloud Platform)

In CloudBees Core on Modern Cloud Platform, the Kube Agent Management plugin sets up a exponential backoff period for agent provisioning failure. Jenkins waits a certain amount of time before retrying to provision a specific agent template.

When hitting that kubernetes issue due to the Resource Quotas, the re-provisioning of the failed agent may be considerably delayed by this backoff period. In environments were this is a problem, an additional workaround is to either disable the backoff period or reduce the maximum backoff time that Jenkins should wait before retrying to provision the same agent.

  • The maximum backoff period maybe be decreased by adding the system property com.cloudbees.jenkins.plugins.kube.common.PlannedSlaveSet.maxBackOffSeconds=<valueInSeconds> (default is 300).

  • Since version 1.1.32 of the Kube Agent Management plugin, the backoff period may be disabled by adding the system property com.cloudbees.jenkins.plugins.kube.Config.enableBackoff=false (default to true)

This workaround does not prevent Agent failures but prevent delaying the re-provisioning of failed agent.