CloudBees Performance Decision Tree for troubleshooting

4 minute read

This performance troubleshooting decision tree is intended to help you clarify and isolate the root causes for common performance problems with Jenkins-based products, including CloudBees CI, CloudBees Jenkins Platform, and Jenkins LTS.

Is your production Jenkins instance down?

Is your production Jenkins instance unavailable or unusable because of performance issues?

Is application navigation slow or unresponsive?

Is your production Jenkins instance available and usable, but the navigation is very slow or unresponsive?

Are you experiencing high CPU or memory consumption?

Is your production Jenkins instance available and usable, but you’re seeing high CPU or memory consumption?

Are jobs executing very slowly?

Is your production Jenkins instance available and usable, but jobs are executing very slowly or not at all?

Are you experiencing an issue with a particular job?

Is your production Jenkins instance available and usable, but you’re experiencing issues with a specific job?

My problem isn’t listed here

If you’re experiencing a problem not already listed here, submit a Support Request explaining the situation in as much detail as possible. Note that you may be asked to provide additional data for analysis.

Subsystem performance troubleshooting

This section provides additional troubleshooting guidance for specific instance subsystems.

Reviewing the Java version and JVM arguments

Are you using best practices, including using the G1 garbage collector?

If you’re seeing issues with Java and/or the JVM, review JVM troubleshooting, and check to make sure that your system is using the G1 garbage collector.

Is the JDK a minimum of 1.8.0_212?

Check your system to make sure that the JDK in use is at least version 1.8.0_212 or higher. JDKs earlier than 1.8.0_212 are not supported.

The supported JDKs for each product are listed in the Supported Platforms pages:

Analyzing thread dumps

Thread dumps are found in the Support Bundle archive as thread-dump.txt files under the /nodes/ directory in the folder for each node. For example, if you have a node called controller, you can find thread dumps for that node under /nodes/controller.

Once you’ve collected thread dumps for your instance, use the free fastthread.io resource to analyze the thread dumps:

  1. In a browser window, navigate to https://fastthread.io.

  2. In the Upload thread dump section, click Choose file to select the thread dump file.

  3. CLick Analyze to upload the thread dumps for analysis.

After uploading and analysis, you will be presented with a report that details findings from the thread dump analysis.

Is the thread count near 700?

Contact CloudBees Support for additional troubleshooting guidance.

Are there a lot of blocked threads?

Contact CloudBees Support for additional troubleshooting guidance.

Are there a lot of threads with the same stack trace?

Contact CloudBees Support for additional troubleshooting guidance.

Analyzing garbage collection logs

Garbage collection logs can be captured as part of a Support Bundle. To capture garbage collection logs:

  1. Click on Support.

  2. Select Garbage Collection Logs.

  3. Click Generate Bundle.

Once you have captured garbage collection logs, use the free gceasy.io resource analyze those logs:

  1. In a browser window, navigate to https://gceasy.io.

  2. In the Upload GC Log File section, click Choose file to select the garbage collection logs.

  3. Click Analyze to upload the logs for analysis.

After uploading and analysis, you will be presented with a report that details any garbage collection issues discovered.

Is application throughput less than 90%?

Application throughput shows how much time the application is NOT spending doing garbage collection. If your throughput is 95% then 5% of the time is spent on garbage collection, which brings your system to a halt. A healthy application will spend less than 1% of time on garbage collection. If you application spends more than 1% on garbage collectiontIt may be incorrectly sized or tuned.

Are there garbage collection pauses of greater than 1 second?

Garbage collection halts your system, meaning nothing else can be done with Jenkins while garbage collection is happening. Garbage collection pauses greater than 1 second are noticeable to the end user so this is a general guideline to avoid going over. The most common reason for this issue is that the JVM heap size is over 16 GB. When you find that your instance needs more than 16 GB, it is time to scale horizontally.

Do the heap graphs indicate a memory leak?

You can use heap usage graphs to help identify memory leaks. If your heap usage graph does not show a healthy sawtooth pattern then please contact CloudBees Support for additional troubleshooting guidance.

Is metaspace usage increasing?

Metaspace shouldn’t be above 1GB. If metaspace exceeds 1GB, it is possible you have a class leak. You can use the commands below to further assess your metadata.

jstat -gc ${PID}

jcmd ${PID} GC.class_stats > GC_class_output.log

Contact CloudBees Support for additional troubleshooting guidance.

Reviewing testio.sh output

The testio.sh script tests input/output throughput for your instance. For usage instructions, see the Knowledge Base article Required Data: IO issues on Linux.

If the testio.sh output indicates issues, you may be using an outdated or unsupported version of NFS. For guidance on using NFS, see the NFS Guide.