How to Troubleshoot and Address Liveness / Readiness probe failure

3 minute read

Symptoms

  • Managed controller is failing, its container is being restarted and the Managed controller item log shows Liveness probe failed: HTTP probe failed with statuscode: 503 or Liveness probe failed: Get https://$POD_IP:8080/$CONTROLLER_NAME/login: dial tcp POD_IP:8080: connect: connection refused

  • Managed controller is failing, its container is being restarted and the Managed controller item log shows Readiness probe failed: HTTP probe failed with statuscode: 503 or Readiness probe failed: Get https://$POD_IP:8080/$CONTROLLER_NAME/login: dial tcp POD_IP:8080: connect: connection refused

  • Managed controller takes a long time to start and eventually fails due to the Liveness or readiness probe

Diagnostic/Treatment

Preconditions

Liveness / Readiness probe failure are caused by Jenkins being not responsive to a health check - currently done https://$POD_IP:8080/$CONTROLLER_NAME/login. Those failures occurs when Jenkins suffers from performance issues and is unresponsive for too long. In most cases, this happens on startup.

Before troubleshooting any further, we recommend to go through the following recommendations that address common causes.

Review Resource Requirements

In containerized environment, it is important that Jenkins gets the resource it needs:

  • Ensure that appropriate container Memory and CPUs are given to the controller (see the "Jenkins controller Memory in MB" and "Jenkins controller CPUs" fields of the Managed controller configuration)

Review Startup Performances Preconditions

Workarounds

Liveness / Readiness probe failures suggest performances issues or slow startup. A quick workaround for such kind of issues is to update those probe to give more slack to Jenkins to start or be responsive. But the probe configuration we want to tweak depends on the nature of the problem: is it failing on startup or while Jenkins is running ?

A Probe fails on Startup

If a probe fails while a Managed controller is starting, a quick workaround is to give more time for Jenkins to start (Note that the Liveness probe failure is causing because if it fails it restarts the container).

Increase the Initial Delay of the Liveness Probe

To increase the Liveness probe initial delay, configure the Managed controller item and update the value of "Health Check Initial Delay". By default it set to 600 (10 minutes). You may increase it to for example 1800 (30 minutes).

Increase the Failure Threshold of the Readiness Probe

To increase the readiness probe failure threshold, configure the Managed controller item and update the value of "Readiness Failure Threshold". By default, it is set to 100 (100 times). You may increase it to, for example, 300.

A Probe fails while Jenkins is running

If a probe fails while a Managed controller is running, it is quite concerning as it suggests that the controller was non responsive for minutes. In such cases, increasing the probes timeout can help to keep the unresponsive controller up for a longer time so that we can collect data.

Increase the Timeout of the Liveness Probe

To increase the Liveness probe timeout, configure the Managed controller item and update the value of "Health Check Timeout". By default it set to 10 (10 seconds). You may increase it to for example 30 (30 seconds).

Increase the Timeout of the Readiness Probe

To increase the Readiness probe timeout, configure the Managed controller item and update the value of "Readiness Timeout". By default it set to 5 (5 seconds). You may increase it to for example 30 (30 seconds).

Data Collection

Although updating the probe configuration can help to get the controller started, it is important to troubleshoot the root cause of the problem, which is usually related to performance.

Failure on startup

If a probe fails while the Managed controller is starting:

Failure while Jenkins is running

If a probe fails while the Managed controller is running:

Modifying a liveness/readiness probe on a running instance

If you’d like to modify the values for the liveness or readiness probes, you can either:

1 ) Go to the Operations center and click the gear for a specific managed controller, and under the Configure page, you can change the values:

liveness UI config

2 ) You can also directly edit the statefulset definition for the pod you would like to change by running:

For the operations center:

kubectl edit statefulset cjoc

or for a specific controller:

kubectl edit statefulset my-controller

and you can directly edit the relevant values:

liveness probe

After you save those changes, the pod will be restarted by Kubernetes automatically, and the new values will be applied. Note: this method will only change the values until the next helm modification of the statefulset, or modifications made by the CloudBees CI product, hence this kubectl edit method should only be relied upon for temporary diagnostic purposes.