After upgrading CloudBees CI on modern platform, the operations center / managed controller pods are evicted due to Ephemeral Storage usage

2 minute readKnowledge base

Issue

After upgrading CloudBees CI Modern, the operations center and / or managed controller are restarting unexpectedly. The node / pod description show:

The node was low on resource: ephemeral-storage. Threshold quantity: ..., available: .... Container jenkins was using ..., request is 0, has larger consumption of ephemeral-storage.

Explanation

Starting version 2.401.3.3, the CI Modern container images automatically set the following system properties on startup:

jenkins.plugins.git.AbstractGitSCMSource.cacheRootDir=/tmp/jenkins/caches/git org.jenkinsci.plugins.github_branch_source.GitHubSCMSource.cacheRootDir=/tmp/jenkins/caches/github-branch-source

This is a requirement for High Availability (active/active) controllers to avoid corruption issues that was brought to the container image. Consequently applied to any kind of managed controller.

It moves the Git plugin cache and GitHub Branch Source okhttp cache from the JENKINS_HOME to the /tmp directory. Although this has a little impact in most environments, it can cause stability issues due to ephemeral storage usage. Especially in environments that require lightweight checkout of large git repositories and using Git SCM.

Resolution

The solution is to add the following System Properties to the operations center / managed controller configuration to restore the default behavior of those features:

jenkins.plugins.git.AbstractGitSCMSource.cacheRootDir=/var/jenkins_home/caches org.jenkinsci.plugins.github_branch_source.GitHubSCMSource.cacheRootDir=/var/jenkins_home/org.jenkinsci.plugins.github_branch_source.GitHubSCMProbe.cache
if these cache directories are put back into /var/jenkins_home using these system properties, ensure your backups omit these directories.
If despite this change, operations center / managed controller pods are still being evicted due to low available ephemeral-storage on the node, then the problem is not related to the change discussed here. The /tmp needs to be evaluated (through monitoring and investigation of files in the /tmp directory) and ephemeral-storage resource requests / limits may be applied as a solution. See Adding ephemeral storage requests and limits to a managed controller ?

Tested product/plugin versions