Troubleshooting CloudBees CI on Amazon EKS

9 minute readTroubleshooting

There are a number of resources that you can use to troubleshoot a CloudBees CI failure.

In this section we will cover each of these approaches.

Consult the Knowledge Base

The Knowledge Base can be very helpful in troubleshooting problems with CloudBees CI and can be accessed on the CloudBees Support site.

Instances provisioning

operations center provisioning

  1. Check pod status.

  2. All associated objects are already created: pod, svc, statefulset (1 - 1), ingress, pvc and pv

  3. Check the events related with the pod and associated objects: see table

  4. Jenkins logs

Managed controller provisioning

  1. Connectivity logs. Probably will give you the answer to the problem

  2. Check pod status

  3. All associated objects are already created: pod, svc, statefulset (1 - 1), ingress, pvc and pv

  4. Check the events related with the pod and associated objects: see table

  5. Jenkins logs

Build agent provisioning

  1. Jenkins logs. Probably will give you the answer to the problem

  2. Check k8s events

  3. Review Kubernetes shared cloud item configuration at operations center.

CloudBees CI basic operations

Viewing cluster resources

# Gives you quick readable detail $ kubectl get -a pod,statefulset,svc,ingress,pvc,pv -o wide # Gives you high level of detail $ kubectl get -a pod,statefulset,svc,ingress,pvc,pv -o yaml # Describe commands with verbose output $ kubectl describe <TYPE> <NAME> # Gives you backend storage information $ kubectl get storageclass

Troubleshooting failed builds in pods

To troubleshoot failing builds that use Kubernetes pods for the build agent:

  1. Go to your Kubernetes pod template configuration.

  2. Change the Pod Retention setting to On Failure.

  3. Run your build again.

Make note of the line at the top of the build log of your new build that starts with Agent java-1r1t5 is provisioned from template …​, as that java-1r1t5 is the name of the pod that is being used for this new build.

Wait for that build to fail. The Pod Retention setting leaves the Kubernetes pod running even after your build fails. You can then use the steps in the Pod access section to get direct shell access to the build pod, and you can try running the failed command again. Check the build log to see what command failed and try running that yourself. Involve the maintainer of the source code that failed to build in the diagnostic process.

You have to manually clean up any failed build pods using kubectl delete pod POD_NAME when you have this Pod Retention setting enabled to On Failure, so you should keep this setting enabled only when you are troubleshooting a build issue.

Pod access

# Access to the bash $ kubectl exec <POD_NAME> -i -t -- bash -li master2-0:/$ ps -ef PID USER TIME COMMAND 1 jenkins 0:00 /sbin/tini -- /usr/local/bin/launch.sh 5 jenkins 1:53 java -Dhudson.slaves.NodeProvisioner.initialDelay=0 -Duser.home=/var/jenkins_home -Xmx1433m -Xms1433m -Djenkins.model.Jenkins.slaveAgentPortEnforce=true -Djenkins.model.Jenkins.slav 481 jenkins 0:00 bash -li 485 jenkins 0:00 ps -ef
# Bash execution command $ kubectl exec <POD_NAME> -- ps -ef PID USER TIME COMMAND 1 jenkins 0:00 /sbin/tini -- /usr/local/bin/launch.sh 5 jenkins 2:05 java -Dhudson.slaves.NodeProvisioner.initialDelay=0 -Duser.home=/var/jenkins_home -Xmx1433m -Xms1433m -Djenkins.model.Jenkins.slaveAgentPortEnforce=true -Djenkins.model.Jenkins.slaveAgentPort=50000 -DMASTER_GRANT_ID=270bd80c-3e5c-498c-88fe-35ac9e11f3d3 -Dcb.IMProp.warProfiles.cje=kubernetes.json -DMASTER_INDEX=1 -Dcb.IMProp.warProfiles=kubernetes.json -DMASTER_OPERATIONSCENTER_ENDPOINT=http://cjoc/cjoc -DMASTER_NAME=master2 -DMASTER_ENDPOINT=http://cje.support-core.beescloud.k8s.local/master2/ -jar -Dcb.distributable.name=Docker Common CJE -Dcb.distributable.commit_sha=888f01a54c12cfae5c66ec27fd4f2a7346097997 /usr/share/jenkins/jenkins.war --webroot=/tmp/jenkins/war --pluginroot=/tmp/jenkins/plugins --prefix=/master2/ 645 jenkins 0:00 ps -ef

Access to the pod logs

kubectl logs -f <POD_NAME>

Pod scale down/up

$ kubectl scale statefulset/master2 --replicas=0 statefulset "master2" scaled $ kubectl get -a statefulset -o wide NAME DESIRED CURRENT AGE CONTAINERS IMAGES cjoc 1 1 1d jenkins cloudbees/cloudbees-cloud-core-oc:2.107.1.2 master1 1 1 2h jenkins cloudbees/cloudbees-core-mm:2.107.1.2 master2 0 0 36m jenkins cloudbees/cloudbees-core-mm:2.107.1.2

CloudBees CI Cluster resources

In the installation phase of CloudBees CI the following service accounts, roles and roles binding are created.

$ kubectl get sa,role,rolebinding NAME SECRETS AGE sa/cjoc 1 21h sa/default 1 21h sa/jenkins 1 21h NAME AGE roles/master-management 21h roles/pods-all 21h NAME AGE rolebindings/cjoc 21h rolebindings/jenkins 21h

Once the installation is done and the CloudBees CI cluster is already up and running, then we can easily check the status of the most important CloudBees CI resources: pod,statefulset,svc,ingress,pvc and pv.

$ kubectl get pod,statefulset,svc,ingress,pvc,pv NAME READY STATUS RESTARTS AGE po/cjoc-0 1/1 Running 0 21h po/master1-0 1/1 Running 0 14h NAME DESIRED CURRENT AGE statefulsets/cjoc 1 1 21h statefulsets/master1 1 1 14h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE svc/cjoc ClusterIP 100.66.207.191 <none> 80/TCP,50000/TCP 21h svc/master1 ClusterIP 100.67.1.49 <none> 80/TCP,50000/TCP 14h NAME HOSTS ADDRESS PORTS AGE ing/cjoc cje.support-core.beescloud.k8s.local af9463f6a2b68... 80 21h ing/default cje.support-core.beescloud.k8s.local af9463f6a2b68... 80 21h ing/master1 cje.support-core.beescloud.k8s.local af9463f6a2b68... 80 14h NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc/jenkins-home-cjoc-0 Bound pvc-c5cad012-2b69-11e8-80fc-12582571ed5c 20Gi RWO gp2 21h pvc/jenkins-home-master1-0 Bound pvc-e4b5e473-2ba2-11e8-80fc-12582571ed5c 50Gi RWO gp2 14h NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pv/pvc-c5cad012-2b69-11e8-80fc-12582571ed5c 20Gi RWO Delete Bound cje-on-support-core/jenkins-home-cjoc-0 gp2 21h pv/pvc-e4b5e473-2ba2-11e8-80fc-12582571ed5c 50Gi RWO Delete Bound cje-on-support-core/jenkins-home-master1-0 gp2 14h
See Kubernetes Troubleshoot Clusters for more information.

In the following sections the expected results of different Kubernetes resources are defined. The definition of each Kubernetes resource was taken from Kubernetes official documentation.

Pods

A pod is the smallest and simplest Kubernetes object, which represents a set of running containers on your cluster. A Pod is typically set up to run a single primary container, although a pod can also run optional sidecar containers that add supplementary features like logging. Pods are commonly managed by a Deployment.

The get pod will provide you current applications running in the cluster. Applications which are currently stopped or not deployed will not appear as a pod of the cluster.

$ kubectl get pod NAME READY STATUS RESTARTS AGE po/cjoc-0 1/1 Running 0 21h po/master1-0 1/1 Running 0 14h
See Kubernetes Debugging Pods for more information.
Pods events

Pod events provide you insights about why a specific pod is failing to start in the cluster. In other words, pod events will tell you the reason why a specific application cannot start or be deployed in the cluster.

The table below summarize the most common pods event which might happen in CloudBees CI.

To get the list of events associated with a given pod you will need to run:

$ kubectl describe pod the_pod_name

For example:

$ kubectl describe pod cjoc-0
Status Events Cause

ImagePullBackOff

The image you are using cannot be found in the Docker registry, or when using a private registry there is no secret configured

Node issues

See below. Get node info with kubectl describe nodes

Pending

Insufficient memory

Not enough memory, either increase the nodes or node size in the cluster or reduce the memory requirement of operations center (yaml file) or controller (under configuration)

Pending

Insufficient cpu

Not enough CPUs, either increase the nodes or node size in the cluster or reduce the CPU requirement of operations center (yaml file) or controller (under configuration)

Pending

NoVolumeZoneConflict

There are no nodes available in the zone where the persistent volume was created, start more nodes in that zone

Pending

CrashLoopBackOff

Find out why the Docker container crashes. The easiest and first check should be if there are any errors in the output of the previous startup, e.g.:

Running but restarting every so often.

describe pod shows Last State: Terminated Reason: OOMKilled Exit Code: 137

The Xmx or MaxRAM JVM parameters are too high for the container memory, try increasing memory limit

Unknown

This usually indicates a bad node, if there are several pods in that node in the same state. Check with `kubectl get pods --all-namespaces -o wide

StatefulSet

A StatefulSet manages the deployment and scaling of a set of Pods and provides guarantees about the ordering and uniqueness of these Pods.

Like a Deployment, a StatefulSet manages Pods that are based on an identical container spec. Unlike a Deployment, a StatefulSet maintains a sticky identity for each of their Pods. These pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.

A StatefulSet operates under the same pattern as any other Controller. You define your desired state in a StatefulSet object and the StatefulSet controller makes any necessary updates to get there from the current state.

$ kubectl get statefulset NAME DESIRED CURRENT AGE statefulsets/cjoc 1 1 21h statefulsets/master1 1 1 14h

In CloudBees CI, the expected DESIRED and CURRENT status of any application should be 1. Not Jenkins, neither build agents supports more than one instance running at the same time.

Service

A service is the API object that describes how to access applications (such as a set of Pods) and can describe ports and load-balancers.

The access point can be internal or external to the cluster.

$ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE svc/cjoc ClusterIP 100.66.207.191 <none> 80/TCP,50000/TCP 21h svc/master1 ClusterIP 100.67.1.49 <none> 80/TCP,50000/TCP 14h

A service must exist for each application running in the cluster. Otherwise, the service will not be accessible.

See Kubernetes Debugging Services for more information.

Ingress

Ingresses represent the routes to access the applications, where an ingress could be thought of as a Load Balancer.

$ kubectl get ingress NAME HOSTS ADDRESS PORTS AGE ing/cjoc cje.support-core.beescloud.k8s.local af9463f6a2b68... 80 21h ing/default cje.support-core.beescloud.k8s.local af9463f6a2b68... 80 21h ing/master1 cje.support-core.beescloud.k8s.local af9463f6a2b68... 80 14h

The required ingresses for CloudBees CI to work are:

  • A ing/default as the default entry point to the cluster

  • A ing/cjoc ingress for the access to the operations center

  • A ing/<MASTER_ID> ingress for the access to each controller

The product expects these ingresses to be present and so they must not be modified - even to reduce the complexity of scope. Modifying ingresses at the Kubernetes level might produce issues in the product, such as managed controllers becoming unable to communicate correctly with the operations center.

Persistent Volume Claims (PVC)

Persistent volume claims (PVCs) represent the volumes associated which each application running in the cluster.

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc/jenkins-home-cjoc-0 Bound pvc-c5cad012-2b69-11e8-80fc-12582571ed5c 20Gi RWO gp2 21h pvc/jenkins-home-master1-0 Bound pvc-e4b5e473-2ba2-11e8-80fc-12582571ed5c 50Gi RWO gp2 14h

PVCs events

The table below summarize the most common pods event associated with PVCs that might occur in CloudBees CI.

To obtain the list of events associated with a given pod, run:

$ kubectl describe pvc the_pvc_name

For example:

$ kubectl describe pvc jenkins-home-cjoc-0
Status Events Cause

Pending

no persistent volumes available for this claim and no storage class is set

There is no default storageclass - follow these instructions to set a default storageclass

Persistent Volume (PV)

The persistent volume represents the volumes created in the cluster.

NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pv/pvc-c5cad012-2b69-11e8-80fc-12582571ed5c 20Gi RWO Delete Bound cje-on-support-core/jenkins-home-cjoc-0 gp2 21h pv/pvc-e4b5e473-2ba2-11e8-80fc-12582571ed5c 50Gi RWO Delete Bound cje-on-support-core/jenkins-home-master1-0 gp2 14h

Accessing $JENKINS_HOME

Accessing Jenkins Home directory (pod running)

By running the following sequence of commands, you can ascertain the path of the $JENKINS_HOME inside a given pod and a specific CloudBees CI instance.

# Get the location of the $JENKINS_HOME $ kubectl describe pod master2-0 | grep " jenkins-home " | awk '{print $1}' /var/jenkins_home # Access the bash of a given pod $ kubectl exec master2-0 -i -t -- bash -i -l master2-0:/$ cd /var/jenkins_home/ master2-0:~$ ps -ef PID USER TIME COMMAND 1 jenkins 0:00 /sbin/tini -- /usr/local/bin/launch.sh 5 jenkins 1:46 java -Dhudson.slaves.NodeProvisioner.initialDelay=0 -Duser.home=/var/jenkins_home -Xmx1433m -Xms1433m -Djenkins.model.Jenkins.slaveAgentPortEnforce=true -Djenkins.model.Jenkins.slav 516 jenkins 0:00 bash -i -l 524 jenkins 0:00 ps -ef master2-0:~$ ps -ef | grep java 5 jenkins 1:46 java -Dhudson.slaves.NodeProvisioner.initialDelay=0 -Duser.home=/var/jenkins_home -Xmx1433m -Xms1433m -Djenkins.model.Jenkins.slaveAgentPortEnforce=true -Djenkins.model.Jenkins.slaveAgentPort=50000 -DMASTER_GRANT_ID=270bd80c-3e5c-498c-88fe-35ac9e11f3d3 -Dcb.IMProp.warProfiles.cje=kubernetes.json -DMASTER_INDEX=1 -Dcb.IMProp.warProfiles=kubernetes.json -DMASTER_OPERATIONSCENTER_ENDPOINT=http://cjoc/cjoc -DMASTER_NAME=master2 -DMASTER_ENDPOINT=http://cje.support-core.beescloud.k8s.local/master2/ -jar -Dcb.distributable.name=Docker Common CJE -Dcb.distributable.commit_sha=888f01a54c12cfae5c66ec27fd4f2a7346097997 /usr/share/jenkins/jenkins.war --webroot=/tmp/jenkins/war --pluginroot=/tmp/jenkins/plugins --prefix=/master2/ 528 jenkins 0:00 grep java # Operations to be done. This is an example $ kubectl cp master2-0:/var/jenkins_home/jobs/ ./jobs/ tar: removing leading '/' from member names

Accessing Jenkins Home directory (pod not running)

# Stop a pod $ kubectl scale statefulset/master2 --replicas=0 statefulset "master2" scaled # Create a new rescue-pod running something with any effect # in the $JENKINS_HOME $ cat <<EOF | kubectl create -f - kind: Pod apiVersion: v1 metadata: name: rescue-pod spec: volumes: - name: rescue-storage persistentVolumeClaim: claimName: jenkins-home-master2-0 containers: - name: rescue-container image: nginx volumeMounts: - mountPath: "/tmp/jenkins-home" name: rescue-storage EOF # Access to the bash of the rescue-pod $ kubectl exec rescue-pod -i -t -- bash -i -l mesg: ttyname failed: Success root@rescue-pod:/# cd /tmp/jenkins-home/ root@rescue-pod:/tmp/jenkins-home# # Operations to be done. This is an example $ kubectl cp rescue-pod:/tmp/jenkins_home/jobs/ ./jobs/ tar: removing leading '/' from member names # Delete the rescue pod $ kubectl delete pod rescue-pod pod "rescue-pod" deleted # Start the pod $ kubectl scale statefulset/master2 --replicas=1 statefulset "master2" scaled

operations center setup customization

The operations center instance could be configured by either editing cloudbees-core.yml or using the Kubernetes command line.

# Set the memory to 2G $ kubectl patch statefulset cjoc -p '{"spec":{"template":{"spec":{"containers":[{"name":"jenkins","resources":{"limits":{"memory": "2G"}}}]}}}}' statefulset "cjoc" patched
# Set initialDelay to 320 seconds $ kubectl patch statefulset cjoc -p '{"spec":{"template":{"spec":{"containers":[{"name":"jenkins","livenessProbe":{"initialDelaySeconds":320}}]}}}}' statefulset "cjoc" patched
# Set timeout to 10 seconds $ kubectl patch statefulset cjoc -p '{"spec":{"template":{"spec":{"containers":[{"name":"jenkins","livenessProbe":{"timeoutSeconds":10}}]}}}}' statefulset "cjoc" patched

Performance issues - high CPU/blocked threads

# export cluster information $ kubectl get pod,svc,endpoints,statefulset,ingress,pvc,pv,sa,role,rolebinding -o yaml > to-el-cluster.yml # collect performance data # https://docs.cloudbees.com/docs/cbsupport/latest/commands/cbsupport_required-data_performance $ cbsupport required-data performance

Expanding managed controller disk space

A managed controller may run out of disk space when the provisioned storage is insufficient for the users of that controller. Kubernetes persistent volumes are fixed sizes defined when the volume is created. Expanding the space for a managed controller requires a restore of the existing managed controller into a persistent volume for a new (larger) managed controller.

Use these steps to restore a backup of an existing managed controller to a new (larger) managed controller:

  1. On the controller that needs more space, run the backup job

  2. Stop the controller in operations center

  3. delete the Persistent Volume Claim in Kubernetes.

    $ kubectl delete pvc jenkins-home-<<master name>>-0
  4. Change the size of the controller’s volume in the configuration.

  5. Start the controller again in operations center

  6. Create a restore job on the controller to restore from the backup made in step 1

  7. Run the restore job, watch the console output of the job, there will be a link to restart the controller after the restore completes

  8. After the expanded controller restarts, verify the jobs on it

Troubleshooting steps for diagnosing an unhealthy Kubernetes node

While CloudBees is not a Kubernetes vendor, our clients often encounter problems in their Kubernetes Cluster that present themselves as issues with CloudBees software.

Follow these steps to help diagnose a Kubernetes node that is not healthy:

  1. Run the following commands to check which Kubernetes nodes each pod is running on, and see if there is a correlation between which node is running each pod and which pods are encountering an issue:

    $ kubectl -n cloudbees-core get node,pod -o wide
  2. Next, the following command will provide more information about a node or pod that is experiencing problems:

    kubectl describe node <name> kubectl describe pod <name>
  3. Use the steps from the Kubernetes documentation for the Monitor Node Health to run the node problem detector.

If you are able to determine a Kubernetes node with an issue, then follow the documentation to Safely Drain a Node while Respecting the PodDisruptionBudget.

Creating a support request

You can call on CloudBees to help resolve your problems. To do this, submit a support request through CloudBees Support Portal. In your request, state the problem and the steps to reproduce the problem. Include support bundles and a cluster description as noted below.

To learn how to create a support bundle, refer to Generating a support bundle.

Required data

Cluster and operations center data

# Create required data folder mkdir cloudbees-core-required-data cd cloudbees-core-required-data # Dump cluster info kubectl cluster-info dump --output-directory=./cluster-state/ # Copy the bundles kubectl cp cjoc-0:/var/jenkins_home/support/ ./cjoc-support/ # Cluster information kubectl cluster-info > 000-cluster-info.txt # cluster description kubectl get node,pod,statefulset,svc,endpoints,ingress,pvc,pv,sa,role,rolebinding -n cloudbees-core -o wide > cloudbees-core-cluster-wide.txt kubectl get node,pod,statefulset,svc,endpoints,ingress,pvc,pv,sa,role,rolebinding -n cloudbees-core -o yaml > cloudbees-core-cluster-wide.yml

Controller data

# Also grab the cluster and data # Copy master1 bundles kubectl cp master1-0:/var/jenkins_home/support/ ./master1-support/

Connectivity checks

#!/bin/bash set -x MM_NAME=test-master CBCI_NAMESPACE=cloudbees-core CJOC_PODIP=$(kubectl get po/cjoc-0 -o jsonpath='{.status.podIP}') MM_PODIP=$(kubectl get po/${MM_NAME}-0 -o jsonpath='{.status.podIP}') TARGET_DIR="connection_status/${CBCI_NAMESPACE}/${MM_NAME}" mkdir -p ${TARGET_DIR} cd ${TARGET_DIR} kubectl exec -ti -n ${CBCI_NAMESPACE} cjoc-0 -- curl localhost:50000 > 001-cjoc-curl-local-5000.txt kubectl exec -ti -n ${CBCI_NAMESPACE} ${MM_NAME}-0 -- curl cjoc:50000 > 002-${MM_NAME}-curl-cjoc-5000.txt kubectl exec -ti -n ${CBCI_NAMESPACE} ${MM_NAME}-0 -- curl ${CJOC_PODIP}:50000 > 003-${MM_NAME}-curl-cjoc-ip-5000.txt kubectl exec -ti -n ${CBCI_NAMESPACE} cjoc-0 -- curl -Iv http://${MM_NAME}.${CBCI_NAMESPACE}.svc.cluster.local/${MM_NAME}/ > 004-cjoc-curl-${MM_NAME}.txt kubectl exec -ti -n ${CBCI_NAMESPACE} ${MM_NAME}-0 -- curl -Iv http://cjoc/cjoc/ > 005-${MM_NAME}-curl-cjoc.txt kubectl exec -ti -n ${CBCI_NAMESPACE} cjoc-0 -- curl -Iv http://${MM_PODIP}:8080/${MM_NAME}/ > 006-cjoc-curl-${MM_NAME}-ip.txt kubectl exec -ti -n ${CBCI_NAMESPACE} ${MM_NAME}-0 -- curl -Iv http://${CJOC_PODIP}:8080/cjoc/ > 007-${MM_NAME}-curl-cjoc-ip.txt

Pipeline build data

kubectl cp master1-0:/var/jenkins_home/jobs/<path-to-pipeline-build-folder> ./pipeline-build-data/
In August 2020, the Jenkins project voted to replace the term master with controller. We have taken a pragmatic approach to cleaning these up, ensuring the least amount of downstream impact as possible. CloudBees is committed to ensuring a culture and environment of inclusiveness and acceptance - this includes ensuring the changes are not just cosmetic ones, but pervasive. As this change happens, please note that the term master has been replaced through the latest versions of the CloudBees documentation with controller (as in managed controller, client controller, team controller) except when still used in the UI or in code.