Timeout when attaching volumes in Kubernetes

Issue

A big amount of files in a volume causes a timeout when trying to attach the said volume
There’s a timeout when waiting for volumes to attach when using Kubernetes
There’s a timeout when trying to attach volumes to a Jenkins instance or Master
I am seeing a warning or an event similar to the one below when trying to attach a volume:

  Unable to mount volumes for pod "<POD_IDENTIFIER>": timeout expired waiting for volumes to attach or mount for pod "jenkins"/"<POD_NAME>". list of unmounted volumes=[output]. list of unattached volumes=[output <VOLUMES>]

Environment

CloudBees CI (CloudBees Core)
CloudBees CI (CloudBees Core) on modern cloud platforms - Managed controller
CloudBees CI (CloudBees Core) on modern cloud platforms - Operations Center
CloudBees CI (CloudBees Core) on traditional platforms - Client controller
CloudBees CI (CloudBees Core) on traditional platforms - Operations Center
CloudBees Jenkins Enterprise
CloudBees Jenkins Enterprise - Managed controller
CloudBees Jenkins Enterprise - Operations center
Jenkins LTS

Related Issue(s)

https://github.com/kubernetes/kubernetes/issues/67014

Explanation

Since CloudBees CI pod are running as non-root users, Operations Center pod and controller pods have the fsGroup set to the Jenkins group 1000 by default. This is so that the volume can be writable by the Pod user.

With this setting, Kubernetes checks and changes the ownership and permissions of all files and directories of each volume mounted to the pod. When a volume is or become very large, this can take a lot of time and slow down the Pod startup.

When impacted, this shows up as multiple occurrences of timeout expired waiting for volumes to attach or mount for pod in the pod events. However, this exception can happen when the volume backend is not yet provisioned and attached to the host.

In Kubernetes 1.20, there is beta feature File System Change Policy that can help to reduce the time it takes to set the permissions. See Configure volume permission and ownership change policy for Pods and Kubernetes 1.20: Granular Control of Volume Permission Changes. This feature is stable since Kubernetes 1.23.

Check if Pod is impacted

When seeing this timeout, first check that the volume is correctly provisioned and attached to the hosts - for example in AWS / EKS, check that the EBS volume is attached to the host.

If the volume is attached, then this is most likely the problem
If the volume is not attached, then this is a different problem related to the provisioning of the external storage and the use of fsGroup is most likely not related.

Solution

When impacted, the recommended solution is to use the fsGroupChangePolicy: OnRootMismatch. This can be done by using a YAML like the following to the configuration of the impacted managed controller(s) (from the operations center, click on the gear icon for a controller, then Configure -> YAML):

---
apiVersion: apps/v1
kind: StatefulSet
spec:
  template:
    spec:
      securityContext:
        fsGroupChangePolicy: "OnRootMismatch"▼

The same configuration can be applied in the operations center under menu:Manage Configure System > Kubernetes controller Provisioning > Advanced > YAML. This setting would apply to newly created controllers only (including Team controllers).

Workaround

The workaround for the problem is to remove the fsGroup. Go to the Controller item configuration, leave the FS Group field empty and Save. Then Restart the Controller from Operations Center.

It is required to keep the fsGroup to 1000 on volume creation - such as when creating a Controller. It is safe to remove the fsGroup of existing volumes. The recommended strategy is to only remove fsGroup when impacted by this particular problem.

Tested environment

AKS - Azure Kubernetes Service
EKS - Amazon Kubernetes Service