Issue
Jenkins jobs will sit in the build queue, and not get started, even when there are build agents available for the chosen 'label', and you see stack traces in the logs similar to the one shown below:
SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@XXXX failed
How do we know what is causing this queue freeze?
Environment
-
CloudBees CI (CloudBees Core) on modern cloud platforms - Managed controller
-
CloudBees CI (CloudBees Core) on modern cloud platforms - Operations Center
-
CloudBees CI (CloudBees Core) on traditional platforms - Client controller
-
CloudBees CI (CloudBees Core) on traditional platforms - Operations Center
-
CloudBees Jenkins Enterprise - Managed controller
-
CloudBees Jenkins Enterprise - Operations center
Resolution
The Queue.MaintainTask is a periodic task which is run in the instance and it is responsible for maintenance operations such as adding elements to the queue, assigning elements in the queue to nodes or executors, etc. If for some reason, this task fails, this causes the queue to become unresponsive and the jobs eventually stop being run as they stay stuck in the queue.
In order to determine what is the cause for this problem, we need to pay special attention to the full stack trace of the error which shows up in the logs.
You will be able to see some potential causes below, the intent of the list below is to allow you understand the pattern that you might follow to determine the root cause of the failure, thus helping you recover the instance as fast as possible. Whenever possible, we will also add some Workaround or Solution details.
A Nodes Plus Plugin
SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@XXX failed Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from XXXX/XXX:XXX at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1743) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357) at hudson.remoting.Channel.call(Channel.java:957) at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1059) at hudson.Launcher$ProcStarter.start(Launcher.java:455) at com.cloudbees.jenkins.plugins.nodesplus.CustomNodeProbeBuildFilterProperty.getProbeResult(CustomNodeProbeBuildFilterProperty.java:180)
In the stacktrace we can clearly see that there is a correlation between the getProbeResult
method and the task failure.
A.1 Solution/Workaround
Check your nodes looking for any custom node probe and disable it first as an initial remediation step.
Verify that you are using cloudbees-nodes-plus
1.18 or higher, as this version included some extra verification that prevented faulty custom probes from causing a queue lock.
If after upgrading the plugin the issue persists, disable any custom node probes and open a support ticket.
B Microfocus Plugin
SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@XXX failed java.lang.NoClassDefFoundError: Could not initialize class org.apache.logging.log4j.core.impl.Log4jLogEvent at org.apache.logging.log4j.core.impl.DefaultLogEventFactory.createEvent(DefaultLogEventFactory.java:54) at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:401) at org.apache.logging.log4j.core.config.DefaultReliabilityStrategy.log(DefaultReliabilityStrategy.java:49) at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146) at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2116) at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2100) at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:1994) at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1966) at org.apache.logging.log4j.spi.AbstractLogger.error(AbstractLogger.java:739) at com.microfocus.application.automation.tools.octane.events.WorkflowListenerOctaneImpl.onNewHead(WorkflowListenerOctaneImpl.java:79)
In this case, the exception thrown is different but the effect is the same, the periodic task starts failing.
B.1 Solution/Workaround
For the Microfocus
plugin, the recommended workaround is to upgrade the plugin at least up to version 5.6.2, as previous versions of the plugin would also be impacted by JENKINS-6070.
C Build Blocker Plugin
SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@XXX failed java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0 *.XXX.*XXX ^ at java.util.regex.Pattern.error(Pattern.java:1957) at java.util.regex.Pattern.sequence(Pattern.java:2125) at java.util.regex.Pattern.expr(Pattern.java:1998) at java.util.regex.Pattern.compile(Pattern.java:1698) at java.util.regex.Pattern.<init>(Pattern.java:1351) at java.util.regex.Pattern.compile(Pattern.java:1028) at java.util.regex.Pattern.matches(Pattern.java:1133) at java.lang.String.matches(String.java:2121) at hudson.plugins.buildblocker.BlockingJobsMonitor.checkForPlannedBuilds(BlockingJobsMonitor.java:162) at hudson.plugins.buildblocker.BlockingJobsMonitor.checkForQueueEntries(BlockingJobsMonitor.java:86) at hudson.plugins.buildblocker.BuildBlockerQueueTaskDispatcher.checkAccordingToProperties(BuildBlockerQueueTaskDispatcher.java:151)
Again, the stacktrace will allow us to determine the source of the problem affecting the queue.
C.1 Solution/Workaround
The remediation step if you find yourself impacted by this issue is either to review that you are using a correct regular expression in the configuration page for the job showing in the thread referenced or you can completely disable the plugin. As this is an old plugin that was last released 5 years ago, we would recommend the latter.
D Pipeline Graph Analysis Plugin
SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@XXX failed java.lang.IndexOutOfBoundsException: Index: 0 at java.util.Collections$EmptyList.get(Collections.java:4454) at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150) at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197) at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
D.1 Solution/Workaround
The solution for this error is to upgrade workflow-api
plugin to version 2.35 or higher. This will ensure that you have the fix to prevent the edge case that triggered this issue.
E Block Queued Job Plugin
SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@XXX failed java.lang.NullPointerException at org.jenkinsci.plugins.blockqueuedjob.condition.JobResultBlockQueueCondition.isBlocked(JobResultBlockQueueCondition.java:70) at org.jenkinsci.plugins.blockqueuedjob.BlockItemQueueTaskDispatcher.canRun(BlockItemQueueTaskDispatcher.java:35) at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)
E.1 Solution/Workaround
The solution for this error is to disable the plugin, this plugin was last released 5 years ago and does not have too many installations, so if you can disable it, that would be the most direct way to get the problem solved.
F Kubernetes Plugin
The queue is blocked and no builds are being processed. Shortly after, the instance goes down. After capturing a thread dump for the instance, you get a stack trace similar to the one shown below:
java.lang.Object.wait(Native Method) hudson.remoting.Request.call(Request.java:177) hudson.remoting.Channel.call(Channel.java:954) org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave._terminate(KubernetesSlave.java:236) hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:67) hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:59) hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:43) hudson.slaves.SlaveComputer$4.run(SlaveComputer.java:843) hudson.model.Queue._withLock(Queue.java:1380)
The queue is locked due to the KubernetesSlave._terminate()
call.
F.1 Solution/Workaround
This is a known issue that was reported in JENKINS-54988 and it is due to a problem with the Kubernetes plugin.
The fix for this issue was released as Kubernetes Plugin 1.21.1