Migrating an existing managed controller to High Availability (HA)

6 minute read

High Availability (HA) in CloudBees CI on modern cloud platforms requires a storage class with ReadWriteMany access mode. If you plan to run a managed controller in High Availability (HA) mode and this managed controller is not already using a ReadWriteMany class you must perform a migration.

The $JENKINS_HOME directory can contain a huge amount of files (many of them small files), and migrating it to a new storage class can take a lot of time. It’s a very demanding I/O process that can reduce and impact the controller performance and, at some point, you will need to stop and restart the managed controller with the new configuration.

To minimize the outage window during the migration follow these steps:

  • Create a volume using the new storage class.

  • Import a snapshot of the existing volume into the new volume.

  • Stop the controller.

  • Sync the latest changes from the source volume to the new volume.

  • Rename the existing volume claims. The binding between a controller and its volume claim is name based.

  • Update controller configuration to enable HA.

  • Start the controller, backed by the new volume.

Required privileges

The following procedures require a number of privileges in your Kubernetes/OpenShift cluster.

  • Job/{create,delete,get,list}

  • PersistentVolumeClaim/{create,delete,get,list}

  • PersistentVolume/{create,patch,delete,get,list}

Please ensure you have the corresponding privileges before proceeding to the next steps. Also ensure that the RWX storage class has been created and tested with a sample application.

Cleaning up the source controller

Prior to starting the migration, consider the following steps to reduce the migration time.

  • Discard fingerprints if you don’t actively require them

  • Discard old builds or any build that is not required for migration.

  • …​

Take a volume snapshot of the source controller

When running on a Cloud provider, most of the time it provides a way to take a snapshot of a live volume backed by a block storage.

Please refer to your cloud provider documentation for explicit details on how to do this.

Create a volume from the latest snapshot

Using your cloud provider tools, you can create a new volume from an existing snapshot. It should be created in the same availability zone as the original volume, to ensure it can be mounted from within the cluster. Write down the volume id somewhere.

Write down the controller domain so that it can be used in the following scripts

export DOMAIN=`<domain>`
Retrieve existing source volume
kubectl get pv $(kubectl get "pvc/jenkins-home-${DOMAIN}-0" -o go-template={{.spec.volumeName}}) -o yaml > pv-backup-jenkins-home-${DOMAIN}-0.yaml (1)

Edit pv-backup-jenkins-home-$DOMAIN-0.yaml as follow

  • Remove .metadata

  • Add .metadata.name, set it for example to backup-jenkins-home-$DOMAIN-0.

  • Remove .spec.claimRef

  • Remove .status

  • Edit .spec to update the volume id reference to the volume id you noted earlier. The exact field varies depending on the cloud provider. For example when using gce persistent disks, it is .spec.gcePersistentDisk.pdName.

Here is a sample volume that applies to a GCE disk as reference.

apiVersion: v1
kind: PersistentVolume
metadata:
  labels:
    failure-domain.beta.kubernetes.io/region: us-east1
    failure-domain.beta.kubernetes.io/zone: us-east1-b
  name: backup-jenkins-home-$DOMAIN-0
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 100Gi
  gcePersistentDisk:
    fsType: ext4
    pdName: backup-volume
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: failure-domain.beta.kubernetes.io/zone
          operator: In
          values:
          - us-east1-b
        - key: failure-domain.beta.kubernetes.io/region
          operator: In
          values:
          - us-east1
  persistentVolumeReclaimPolicy: Delete
  storageClassName: my-source-storage-class
  volumeMode: Filesystem
  • Create the new persistent volume with kubectl create -f pv-backup-jenkins-home-${DOMAIN}-0.yaml

Create a new persistent volume claim referencing the new persistent volume

kubectl get "pvc/jenkins-home-${DOMAIN}-0" -o yaml > pvc-backup-jenkins-home-${DOMAIN}-0.yaml (1)

Edit pvc-backup-jenkins-home-${DOMAIN}-0.yaml as follow

  • Remove .metadata

  • Add .metadata.name, set it for example to backup-jenkins-home-${DOMAIN}-0

  • Edit .spec.volumeName to point to the persistent volume name you created just above (backup-jenkins-home-${DOMAIN}-0 unless you changed it to something else)

  • Remove .status

Here is a sample persistent volume claim applying to the previously referenced persistent volume.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: backup-jenkins-home-${DOMAIN}-0
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: <storage-class>
  volumeMode: Filesystem
  volumeName: backup-jenkins-home-${DOMAIN}-0
1 Replace <storage-class> with the name of the storage class.

kubectl create -f pvc-backup-jenkins-home-${DOMAIN}-0.yaml

Nowadays, most storage classes use VolumeBindingMode = WaitForFirstConsumer which means that to bind the persistent volume claim we need to create a pod using it.

  • Create a pod referencing the PVC you just created

Create allocate-backup-jenkins-home-${DOMAIN}-0.yaml as follow

cat > allocate-backup-jenkins-home-${DOMAIN}-0.yaml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: allocate-backup-jenkins-home-${DOMAIN}-0
spec:
  template:
    spec:
      volumes:
      - name: volume
        persistentVolumeClaim:
          claimName: backup-jenkins-home-${DOMAIN}-0
      containers:
      - name: busybox
        image: busybox
        command: ["true"]
        volumeMounts:
          - mountPath: /var/volume
            name: volume
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 100Mi
      restartPolicy: Never
  backoffLimit: 4
EOF

Create the job using kubectl create -f allocate-backup-jenkins-home-${DOMAIN}-0.yaml

Then wait for the job to be running

kubectl wait --for=condition=complete job/allocate-backup-jenkins-home-${DOMAIN}-0

You can then delete the job.

kubectl delete job/allocate-backup-jenkins-home-${DOMAIN}-0

Create a volume using the new storage class

Create rwx-jenkins-home-${DOMAIN}-0.yaml as follow

cat > rwx-jenkins-home-${DOMAIN}-0.yaml <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: rwx-jenkins-home-<domain>
spec:
  storageClassName: <your-new-storage-class>
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2560Gi # Change this to whatever your storage class requires, or to your needs
EOF

We can tweak the previous job to allocate the volume.

Create allocate-rwx-jenkins-home-${DOMAIN}-0.yaml as follow

cat > allocate-rwx-jenkins-home-${DOMAIN}-0.yaml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: allocate-rwx-jenkins-home-<domain>
spec:
  template:
    spec:
      volumes:
      - name: volume
        persistentVolumeClaim:
          claimName: rwx-jenkins-home-${DOMAIN}-0
      containers:
      - name: busybox
        image: busybox
        command: ["true"]
        volumeMounts:
          - mountPath: /var/volume
            name: volume
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 100Mi
      restartPolicy: Never
  backoffLimit: 4
EOF

Then wait for the job to be running

kubectl wait --for=condition=complete job/allocate-rwx-jenkins-home-${DOMAIN}-0

Initial sync: synchronize snapshot to new volume

Apply the migration script above to synchronize the backup volume and the new volume.

Create the script pvc-sync.sh as follow.

#!/bin/bash
set -euo pipefail

if [ $# -lt 2 ]; then
  echo "Usage: $0 <source_pvc> <dest_pvc>"
  echo "Example: $0 backup-source-pvc new-volume-rwx"
  exit 1
fi
source_pvc="${1:?}"
dest_pvc="${2:?}"

if ! kubectl get "pvc/$source_pvc" -o name > /dev/null 2>&1; then
  echo "PVC $source_pvc does not exist."
  exit 1
fi

if ! kubectl get "pvc/$dest_pvc" -o name > /dev/null 2>&1; then
  echo "PVC $dest_pvc does not exist."
  exit 1
fi

if [ "$source_pvc" == "$dest_pvc" ]; then
  echo "Source and destination PVC must be different."
  exit 1
fi

echo "1. Migration step"
kubectl apply -f - <<JOB
apiVersion: batch/v1
kind: Job
metadata:
  name: migration
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    spec:
      volumes:
      - name: volume1
        persistentVolumeClaim:
          claimName: ${source_pvc}
      - name: volume2
        persistentVolumeClaim:
          claimName: ${dest_pvc}
      containers:
      - name: migration
        image: registry.access.redhat.com/ubi8/ubi
        command: [sh]
        args: [-c, "dnf install -y rsync; rsync -avvu --delete /var/volume1/ /var/volume2"]
        volumeMounts:
          - mountPath: /var/volume1
            name: volume1
          - mountPath: /var/volume2
            name: volume2
        resources:
          limits:
            cpu: "2"
            memory: 4G
          requests:
            cpu: "2"
            memory: 4G
      restartPolicy: Never
  backoffLimit: 4
JOB

echo "Waiting for migration to complete"
echo "You can inspect progress using kubectl logs -f job/migration"
kubectl wait --for=condition=complete --timeout=900m job/migration
echo "== Data from $source_pvc has been copied over to $dest_pvc"
kubectl delete job migration

Make it executable with chmod +x pvc-sync.sh.

Then run pvc-sync.sh backup-jenkins-home-${DOMAIN}-0 rwx-jenkins-home-${DOMAIN}-0 to start the initial synchronization.

Depending on the volume size, this can take a lot of time (multiple hours).

Once completed, you will need to ensure that ownership of the root directory of the new volume will match with the expected uid/gid your controller is running as (1000/1000 on Kubernetes, will differ on OpenShift, check your actual setup before applying this).

After obtaining a shell,

chown 1000:1000 /var/volume

OUTAGE STARTS HERE - Stop the controller

In Operations Center, stop the controller.

Rename source volume

Create a script rename_pvc.sh with the following content

#!/bin/bash
set -euo pipefail

if [ $# -ne 1 ]; then
  echo "Usage: $0 <domain>"
  echo "Example: $0 mc"
  exit 1
fi
domain="${1:?}"
source_pvc="jenkins-home-$domain-0"
dest_pvc="old-jenkins-home-$domain-0"

if ! kubectl get "pvc/$source_pvc" -o name > /dev/null 2>&1; then
  echo "PVC $source_pvc does not exist."
  exit 1
fi

if kubectl get "pvc/$dest_pvc" -o name > /dev/null 2>&1; then
  echo "PVC $dest_pvc already exists. It will be replaced by persistent volume of $source_pvc."
  read -p "Are you sure? " -n 1 -r
  # Delete PVC-1, keep PV as backup
  pv1_name=$(kubectl get "pvc/${dest_pvc}" -o go-template={{.spec.volumeName}})
  echo "$pv1_name" > old_pv
  echo "== ${dest_pvc} points to pv/${pv1_name}"
  kubectl patch pv ${pv1_name} -p '{"spec": {"persistentVolumeReclaimPolicy": "Retain"}}'
  echo "Deleting ${dest_pvc}, we keep PV ${pv1_name} around"
  kubectl delete pvc/${dest_pvc}
  #kubectl patch pv ${pv1_name} -p '{"spec":{"claimRef": null}}'
fi

mkdir generated
# Rename pvc-2 to pvc-1
# Change PV RetainPolicy to "Retain"

pv_name=$(kubectl get "pvc/${source_pvc}" -o go-template={{.spec.volumeName}})
echo "== ${source_pvc} points to pv/${pv_name}"
kubectl get "pvc/${source_pvc}" -o yaml > generated/source_pvc.yaml
kubectl patch pv ${pv_name} -p '{"spec": {"persistentVolumeReclaimPolicy": "Retain"}}'
kubectl delete "pvc/${source_pvc}"
kubectl patch pv ${pv_name} -p '{"spec":{"claimRef": null}}'

# We ideally want to retain any user annotation
cat <<EOF >generated/patch.yaml
- op: replace
  path: /metadata/name
  value: ${dest_pvc}
- op: replace
  path: /spec/volumeName
  value: ${pv_name}
- op: remove
  path: /metadata/finalizers
- op: remove
  path: /metadata/creationTimestamp
- op: remove
  path: /metadata/namespace
- op: remove
  path: /metadata/resourceVersion
- op: remove
  path: /metadata/uid
- op: remove
  path: /status
EOF

cat <<EOF >generated/kustomization.yaml
patches:
 - target:
      version: v1
      kind: PersistentVolumeClaim
      name: ${source_pvc}
   path: patch.yaml
resources:
- source_pvc.yaml
EOF

trap "rm -rf generated" EXIT

kubectl apply -k generated/
pv_name=$(kubectl get pvc/${dest_pvc} -o go-template={{.spec.volumeName}})
echo "== ${dest_pvc} points to pv/${pv_name}"
echo "== Resetting ${pv_name} retain policy to Delete"
kubectl patch pv ${pv_name} -p '{"spec": {"persistentVolumeReclaimPolicy": "Delete"}}'

echo "== You can rename ${dest_pvc} to ${source_pvc} using ./2-rename_pvc.sh ${dest_pvc} ${source_pvc}"

Make it executable with chmod +x rename_pvc.sh.

Then run rename_pvc.sh jenkins-home-$DOMAIN-0 old-jenkins-home-$DOMAIN-0.

Delta sync: synchronize source volume to new volume

This is the same as the initial sync, except that now the source pvc is not the backup, but the PVC with the live data as it will contain all the latest changes.

This delta sync will take a fraction of the time of the initial sync, however it can still be lengthy depending on the number of files in your filesystem, as rsync will still scan them to determine whether they have changed.

Using the pvc-sync.sh created previously,

run pvc-sync.sh old-jenkins-home-${DOMAIN}-0 rwx-jenkins-home-${DOMAIN}-0.

Edit the controller configuration

This can be done while the Delta sync migration is ongoing.
  • Enable High availability. Instead of a StatefulSet, a Deployment will now be used to manage the controller pods. If you have existing YAML customizations, you will need to adjust them to replace StatefulSet by Deployment.

  • Set to 1 replica (can be increased later)

  • Set Storage Class Name to the new storage class name

  • Edit or set YAML to

apiVersion: "apps/v1"
kind: Deployment
spec:
  template:
    spec:
      securityContext:
        fsGroupChangePolicy: OnRootMismatch

Rename new volume to the previous name

Rename the new volume to the previous name, so that the controller will be able to mount it.

Run rename_pvc.sh rwx-jenkins-home-$DOMAIN-0 jenkins-home-$DOMAIN-0.

OUTAGE ENDS HERE - Start the controller on the new volume

In Operations Center, start the managed controller.

Remove the previous volume

Once you’ve confirmed that everything is running fine on the new volume, the previous volume as well as its backup volume can be removed.

kubectl delete pvc backup-jenkins-home-$DOMAIN-0 old-jenkins-home-$DOMAIN-0

In case the persistent volume persistentVolumeReclaimPolicy is not set to Delete, you may also need to clean up the backup volume afterward.

kubectl delete pv backup-jenkins-home-$DOMAIN-0