Issue
-
A big amount of files in a volume causes a timeout when trying to attach the said volume
-
There’s a timeout when waiting for volumes to attach when using Kubernetes
-
There’s a timeout when trying to attach volumes to a Jenkins instance or Master
-
I am seeing a warning or an event similar to the one below when trying to attach a volume:
Unable to mount volumes for pod "<POD_IDENTIFIER>": timeout expired waiting for volumes to attach or mount for pod "jenkins"/"<POD_NAME>". list of unmounted volumes=[output]. list of unattached volumes=[output <VOLUMES>]
Environment
-
CloudBees CI (CloudBees Core) on modern cloud platforms - Managed controller
-
CloudBees CI (CloudBees Core) on modern cloud platforms - Operations Center
-
CloudBees CI (CloudBees Core) on traditional platforms - Client controller
-
CloudBees CI (CloudBees Core) on traditional platforms - Operations Center
-
CloudBees Jenkins Enterprise
-
CloudBees Jenkins Enterprise - Managed controller
-
CloudBees Jenkins Enterprise - Operations center
Explanation
Since CloudBees CI pod are running as non-root users, Operations Center pod and controller pods have the fsGroup
set to the Jenkins group 1000
by default. This is so that the volume can be writable by the Pod user.
With this setting, Kubernetes checks and changes the ownership and permissions of all files and directories of each volume mounted to the pod. When a volume is or become very large, this can take a lot of time and slow down the Pod startup.
When impacted, this shows up as multiple occurrences of timeout expired waiting for volumes to attach or mount for pod
in the pod events. However, this exception can happen when the volume backend is not yet provisioned and attached to the host.
In Kubernetes 1.20, there is beta feature File System Change Policy that can help to reduce the time it takes to set the permissions. See Configure volume permission and ownership change policy for Pods and Kubernetes 1.20: Granular Control of Volume Permission Changes. This feature is stable since Kubernetes 1.23.
Check if Pod is impacted
When seeing this timeout, first check that the volume is correctly provisioned and attached to the hosts - for example in AWS / EKS, check that the EBS volume is attached to the host.
-
If the volume is attached, then this is most likely the problem
-
If the volume is not attached, then this is a different problem related to the provisioning of the external storage and the use of
fsGroup
is most likely not related.
Solution
When impacted, the recommended solution is to use the fsGroupChangePolicy: OnRootMismatch. This can be done by using a YAML like the following to the configuration of the impacted managed controller(s) (from the operations center, click on the gear icon for a controller, then Configure
-> YAML
):
--- apiVersion: apps/v1 kind: StatefulSet spec: template: spec: securityContext: fsGroupChangePolicy: "OnRootMismatch"
The same configuration can be applied in the operations center under menu:Manage Configure System > Kubernetes controller Provisioning > Advanced > YAML. This setting would apply to newly created controllers only (including Team controllers).
Workaround
The workaround for the problem is to remove the fsGroup
. Go to the Controller item configuration, leave the FS Group field empty and Save. Then Restart the Controller from Operations Center.
It is required to keep the fsGroup to 1000 on volume creation - such as when creating a Controller. It is safe to remove the fsGroup of existing volumes. The recommended strategy is to only remove fsGroup when impacted by this particular problem.
|