Symptoms
I’ve run the script collectPerformanceData.sh and shared the data with CloudBees Support. What is this and how does it help us understand what’s causing my controller’s CPU to spike?
Diagnosis/Treatment
The script collectPerformanceData.sh uses seven common tools top, vmstat, netstat, nfsstat, nfsiostat, iostat,
and jstack
to capture diagnostic information.
Run with the default values of a 60 second duration and an interval of 5 seconds, each tool will be invoked 12 times evenly spread over one minute. Simply taking one data point would not provide a complete picture of what’s causing the CPU to spike. Increasing the duration of the test to 5 minutes or more is often desirable.
If you were to extract the bundle which you share with CloudBees Support, there is a directory for the output of each tool.
But how exactly can you use this?
One example of how this data can be correlated to gain a better understanding of what is causing high CPU values is with top
and jstack
.
First, let’s investigate the topOutput folder. Each file in this folder captures the output of running the top command. The %CPU value is the ninth column in each file.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4323 jenkins+ 20 0 17.9g 15g 17m S 335.3 64.0 36:35.87 /etc/alternatives/java -D...
Let’s find when the controller’s CPU peaked during our test. To do this, we must filter out all other processes
so we can compare the Jenkins java process over the course of our test. We know the process id as we used
this when we invoked the script. Combining the process id and the process owner (typically jenkins), we
can use grep
to see only the processes we are interested in:
grep -H "4323 jenkins" *.txt
Or to make comparison easier to do, we can remove the arguments to the java command.
grep "4323 jenkins" *.txt | sed -e 's/java .*//' topOutput.20180925112257.txt: 4323 jenkins+ 20 0 17.9g 15g 17m S 98.4 64.0 36:33.79 /etc/alternatives/ topOutput.20180925112318.txt: 4323 jenkins+ 20 0 17.9g 15g 17m S 240.5 64.0 36:34.31 /etc/alternatives/ topOutput.20180925112339.txt: 4323 jenkins+ 20 0 17.9g 15g 17m S 96.5 64.0 36:34.84 /etc/alternatives/ topOutput.20180925112359.txt: 4323 jenkins+ 20 0 17.9g 15g 17m S 335.3 64.0 36:35.35 /etc/alternatives/ topOutput.20180925112420.txt: 4323 jenkins+ 20 0 17.9g 15g 17m S 110.4 64.0 36:35.87 /etc/alternatives/
Taking the above output as an example, we see the %CPU spiked to it’s highest point with the collection timestamped at 20180925112359. So what was happening at that time?
The next step is to look at the threads running at that time. Inside the topdashHOutput folder, open the corresponding topdashHOutput file which has the same timestamp in its name. The naming convention is "topdashHOutput.PID.timestamp.txt". In our example, this is topdashHOutput.4323.20180925112359.txt. The topdashHOutput files list the java process’s threads ordered by the highest CPU usage.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20379 jenkins+ 20 0 17.9g 15g 17m R 89.2 64.0 248:38.86 java
At this point, we can correlate the thread with PID 20379 to the data collected from jstack. Open the file jstack.4323.20180925112359.txt located in the jstack folder.
To make this correlation, we need to convert the PID value 20379 to hex. We can use printf "%x\n" 20379
to do this, which gives us 4f98. We then prepend '0x' so that we have 0x4f98. Now we use grep
to find the
thread who’s 'nid' value is 0x4f98 in the jstack file.
"Gang worker#0 (Parallel GC Threads)" os_prio=0 tid=0x00007f2558caa800 nid=0x4f98 runnable
So, in this example, we see that the CPU was spiking due to garbage collection! And that’s how easy it is to get a quick insight into what was contributing to high CPU levels.