Jenkins stops processing builds in the build queue after an error appears in the logs

Issue

Jenkins jobs will sit in the build queue, and not get started, even when there are build agents available for the chosen 'label', and you see stack traces in the logs similar to the one shown below:

SEVERE  hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@XXXX failed

How do we know what is causing this queue freeze?

Environment

CloudBees CI (CloudBees Core) on modern cloud platforms - Managed controller
CloudBees CI (CloudBees Core) on modern cloud platforms - Operations Center
CloudBees CI (CloudBees Core) on traditional platforms - Client controller
CloudBees CI (CloudBees Core) on traditional platforms - Operations Center
CloudBees Jenkins Enterprise - Managed controller
CloudBees Jenkins Enterprise - Operations center
Jenkins LTS

Resolution

The Queue.MaintainTask is a periodic task which is run in the instance and it is responsible for maintenance operations such as adding elements to the queue, assigning elements in the queue to nodes or executors, etc. If for some reason, this task fails, this causes the queue to become unresponsive and the jobs eventually stop being run as they stay stuck in the queue.

In order to determine what is the cause for this problem, we need to pay special attention to the full stack trace of the error which shows up in the logs.

You will be able to see some potential causes below, the intent of the list below is to allow you understand the pattern that you might follow to determine the root cause of the failure, thus helping you recover the instance as fast as possible. Whenever possible, we will also add some Workaround or Solution details.

A Nodes Plus Plugin

SEVERE  hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@XXX failed
Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from XXXX/XXX:XXX
                at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1743)
                at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
                at hudson.remoting.Channel.call(Channel.java:957)
                at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1059)
                at hudson.Launcher$ProcStarter.start(Launcher.java:455)
                at com.cloudbees.jenkins.plugins.nodesplus.CustomNodeProbeBuildFilterProperty.getProbeResult(CustomNodeProbeBuildFilterProperty.java:180)

In the stacktrace we can clearly see that there is a correlation between the getProbeResult method and the task failure.

A.1 Solution/Workaround

Check your nodes looking for any custom node probe and disable it first as an initial remediation step.

Verify that you are using cloudbees-nodes-plus 1.18 or higher, as this version included some extra verification that prevented faulty custom probes from causing a queue lock.

If after upgrading the plugin the issue persists, disable any custom node probes and open a support ticket.

B Microfocus Plugin

 SEVERE  hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@XXX failed
java.lang.NoClassDefFoundError: Could not initialize class org.apache.logging.log4j.core.impl.Log4jLogEvent
    at org.apache.logging.log4j.core.impl.DefaultLogEventFactory.createEvent(DefaultLogEventFactory.java:54)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:401)
    at org.apache.logging.log4j.core.config.DefaultReliabilityStrategy.log(DefaultReliabilityStrategy.java:49)
    at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
    at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2116)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2100)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:1994)
    at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1966)
    at org.apache.logging.log4j.spi.AbstractLogger.error(AbstractLogger.java:739)
    at com.microfocus.application.automation.tools.octane.events.WorkflowListenerOctaneImpl.onNewHead(WorkflowListenerOctaneImpl.java:79)

In this case, the exception thrown is different but the effect is the same, the periodic task starts failing.

B.1 Solution/Workaround

For the Microfocus plugin, the recommended workaround is to upgrade the plugin at least up to version 5.6.2, as previous versions of the plugin would also be impacted by JENKINS-6070.

C Build Blocker Plugin

SEVERE  hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@XXX failed
java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
*.XXX.*XXX
^
    at java.util.regex.Pattern.error(Pattern.java:1957)
    at java.util.regex.Pattern.sequence(Pattern.java:2125)
    at java.util.regex.Pattern.expr(Pattern.java:1998)
    at java.util.regex.Pattern.compile(Pattern.java:1698)
    at java.util.regex.Pattern.<init>(Pattern.java:1351)
    at java.util.regex.Pattern.compile(Pattern.java:1028)
    at java.util.regex.Pattern.matches(Pattern.java:1133)
    at java.lang.String.matches(String.java:2121)
    at hudson.plugins.buildblocker.BlockingJobsMonitor.checkForPlannedBuilds(BlockingJobsMonitor.java:162)
    at hudson.plugins.buildblocker.BlockingJobsMonitor.checkForQueueEntries(BlockingJobsMonitor.java:86)
    at hudson.plugins.buildblocker.BuildBlockerQueueTaskDispatcher.checkAccordingToProperties(BuildBlockerQueueTaskDispatcher.java:151)

Again, the stacktrace will allow us to determine the source of the problem affecting the queue.

C.1 Solution/Workaround

The remediation step if you find yourself impacted by this issue is either to review that you are using a correct regular expression in the configuration page for the job showing in the thread referenced or you can completely disable the plugin. As this is an old plugin that was last released 5 years ago, we would recommend the latter.

D Pipeline Graph Analysis Plugin

SEVERE    hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@XXX failed
java.lang.IndexOutOfBoundsException: Index: 0
    at java.util.Collections$EmptyList.get(Collections.java:4454)
    at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
    at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
    at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)

D.1 Solution/Workaround

The solution for this error is to upgrade workflow-api plugin to version 2.35 or higher. This will ensure that you have the fix to prevent the edge case that triggered this issue.

E Block Queued Job Plugin

SEVERE  hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@XXX failed
java.lang.NullPointerException
    at org.jenkinsci.plugins.blockqueuedjob.condition.JobResultBlockQueueCondition.isBlocked(JobResultBlockQueueCondition.java:70)
    at org.jenkinsci.plugins.blockqueuedjob.BlockItemQueueTaskDispatcher.canRun(BlockItemQueueTaskDispatcher.java:35)
    at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)

E.1 Solution/Workaround

The solution for this error is to disable the plugin, this plugin was last released 5 years ago and does not have too many installations, so if you can disable it, that would be the most direct way to get the problem solved.

F Kubernetes Plugin

The queue is blocked and no builds are being processed. Shortly after, the instance goes down. After capturing a thread dump for the instance, you get a stack trace similar to the one shown below:

	java.lang.Object.wait(Native Method)
	hudson.remoting.Request.call(Request.java:177)
	hudson.remoting.Channel.call(Channel.java:954)
	org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave._terminate(KubernetesSlave.java:236)
	hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:67)
	hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:59)
	hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:43)
	hudson.slaves.SlaveComputer$4.run(SlaveComputer.java:843)
	hudson.model.Queue._withLock(Queue.java:1380)

The queue is locked due to the KubernetesSlave._terminate() call.

F.1 Solution/Workaround

This is a known issue that was reported in JENKINS-54988 and it is due to a problem with the Kubernetes plugin.

The fix for this issue was released as Kubernetes Plugin 1.21.1