Performance issues caused by thread contention on PluginImpl.getSlaveManifest

1 minute readKnowledge base

Issue

  • CloudBees CI controller becomes unresponsive

  • Thread dumps and slow requests reveal BLOCKED threads on com.cloudbees.opscenter.client.cloud.PluginImpl.getSlaveManifest. Typically::

    Computer.threadPoolForRemoting [#12345] java.lang.Thread.State: BLOCKED (on object monitor) at com.cloudbees.opscenter.client.cloud.PluginImpl.getSlaveManifest(PluginImpl.java:222) - waiting to lock <0x0000000123456a12> (a com.cloudbees.opscenter.client.cloud.PluginImpl)
  • BEE-28991: Remove usage of mapDB in Operations Center Cloud

  • BEE-48985: Un-necessary synchronization in PluginImpl.getSlaveManifest

Explanation

Starting version 2.387.2.3 of CloudBees CI, the persistence of leased shared agents manifests is not handled by a local MapDB database but by XML file, in $JENKINS_HOME/com.cloudbees.opscenter.client.cloud.XmlSlaveManifestImpl.xml on the controller. This change introduced a performance issue as access to the manifest is now heavily guarded by object synchronization. The impact can be significant depending on the environment. Mainly dependent on the usage of Shared Agents, but also underlying file system performance.

Resolution

The solution is to upgrade CloudBees CI to version 2.452.1.2 or later.

Job Config History plugin can exarcerbate the problem as it will record changes to the $JENKINS_HOME/com.cloudbees.opscenter.client.cloud.XmlSlaveManifestImpl.xml. This file must be added to exclude pattern list of the Job config History plugin configuration. See JobConfigHistory Plugin Best Practices.