Significant thread spikes caused by Multibranch Workspace Cleanup

Issue

After Organization Scan / Branch Indexing, the controller becomes slow or even unresponsive

When collecting a thread dump, we see hundreds of Computer.threadPoolForRemoting threads executing jenkins.branch.WorkspaceLocatorImpl$Deleter$CleanupTask:

"Computer.threadPoolForRemoting [#1234]: deleting workspace for myMultibranch/myBranch in /pat/to/workspace/of/myMultibranch/myBranch on myAgent
    at jenkins.branch.WorkspaceLocatorImpl.locate(WorkspaceLocatorImpl.java:179)
    at jenkins.branch.WorkspaceLocatorImpl.locate(WorkspaceLocatorImpl.java:150)
    at jenkins.branch.WorkspaceLocatorImpl.access$100(WorkspaceLocatorImpl.java:86)
    at jenkins.branch.WorkspaceLocatorImpl$Deleter$CleanupTask.run(WorkspaceLocatorImpl.java:469)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
    at java.base@11.0.16.1/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base@11.0.16.1/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base@11.0.16.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base@11.0.16.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base@11.0.16.1/java.lang.Thread.run(Thread.java:829)

Environment

Related Issue(s)

Explanation

When a Branch Job of a Multibranch Pipeline is deleted - commonly following the orphaned item strategy of a Branch Indexing scan - cleanup tasks are scheduled asynchronously to delete the workspace that this branch might have on all currently connected nodes. The tasks are scheduled an unbounded Thread Pool Computer.threadPoolForRemoting. There is one thread created per deleted branch per node. Depending on the environment this can cause very large thread spikes and consequently cause significant slow down and performance degradation.

Resolution

Since version 2.1071.v1a_188a_562481 of branch-api included in CloudBees CI since version 2.375.3.3, the system property -Djenkins.branch.WorkspaceLocatorImpl\$Deleter.CLEANUP_THREAD_LIMIT can be used to limit the number of concurrent threads for branch Job workspace cleanup. The default is 0 - i.e. unlimited.

The solution is to limit the number of threads for those tasks via the system property -Djenkins.branch.WorkspaceLocatorImpl\$Deleter.CLEANUP_THREAD_LIMIT.

The appropriate value depends on the environment. In most environments where this happens, the number of nodes is the dominant factor but most threads will be short-lived because most nodes actually do not contain a workspace for the relevant job, only a few have built it. Therefore the task will quickly detect that none exist on a node and exit. In general, a good value is within the 1XX or less but not in the 1XXX.

It was recently discovered that the limit set by -Djenkins.branch.WorkspaceLocatorImpl\$Deleter.CLEANUP_THREAD_LIMIT is not fully effective. The limit set may increase over time after Jenkins has started. See JENKINS-74975: WorkspaceLocatorImpl$Deleter.CLEANUP_THREAD_LIMIT is not effective for more details and an upcoming fix for this.