Precondition: The agent was working properly before the issue
The build has failed because the connection between the controller and the agent got broken
The build is stalled in the queue waiting for the agent
The agent is disconnected and cannot connect again
Channel is broken warning at logs
Any of the exceptions listed below
Before trying anything else, a manual ssh connection from the Jenkins server shell to the agent has to be performed, in order to discard basic connectivity problems.
You could use this command that will verify the connection periodically.
ssh user@agent_ip "bash -c 'while true ; do export sleeptime=$(( 60*60 )) ; echo "Hey, time is '$(date)'. Now sleeping for \$sleeptime seconds" ; sleep \$sleeptime ; done'"
There are some things required before starting to diagnose the issue that needs to be provided:
A support bundle, ideally generated when the agent is connected. Make sure to tick the Agent Configuration File, Agent Log Recorders and Controller Log Recorders checkboxes.
The Agent configuration file (
config.xml). The file can be recovered either from
$JENKINS_HOME/nodes/[agent_name]/config.xmlor from a browser at
The Agent logs. If you managed to capture a support bundle while the agent was online then you can skip this section. Otherwise you need to follow those additional steps:
If you are using bash, add Suffix Start Agent Command:
2> >(tee -a $( date '+test.%H-%M-%S.txt' ) 1>&2)Notice that there is a space before the 2.
If you are using ksh
Prefix Start Agent Command:
bash -c "
Suffix Start Agent Command:
2> >(tee -a agent-stderr.log 1>&2)"(you need to add a space at the beginning)
If you are using the CloudBees SSH Build Agents plugin, then you also need to activate the Log environment on initial connect in the agent configuration page. Make sure to disable it once you are finished resolving the issue as the option can impact the agent startup performance.
When the issue happens, collect the 3 items as well as the build console log output of an impacted job (if applicable).
About the Ping Thread
Check the Ping Thread Documentation here.
The PingThread checks that agent is ABLE to execute a command from the controller (NOOP request)
Ping command may fail to execute:
Overloaded queue, all agent workers are busy → On big boxes you can increase the number of remoting TaskPool workers
In some cases disabling can help
So, if this is the stacktrace you are seeing all the time, you should then disable the PingThread. The side effect is just that the agent is supposed to hung in case the communication is really failing between controller and agents, but this is good as you will then use jstack to take a threadDump on both sides controller and agent it self.
Here is an example of a stack caused by a PingThread failure:
Caused by: java.io.IOException at hudson.remoting.Channel.close(Channel.java:1163) at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118) at hudson.remoting.PingThread.ping(PingThread.java:126) at hudson.remoting.PingThread.run(PingThread.java:85) Caused by: java.util.concurrent.TimeoutException: Ping started at 1474633728617 hasn't completed by 1474633968617 ... 2 more
Try to use the other agent implementation (move from SSH Build Agents plugin to CloudBees SSH Build Agents plugin or vice versa).
TCP retransmission timeout
sysctl -w net.ipv4.tcp_keepalive_time=120 sysctl -w net.ipv4.tcp_keepalive_intvl=30 sysctl -w net.ipv4.tcp_keepalive_probes=8 sysctl -w net.ipv4.tcp_fin_timeout=30
KeepAliveInterval = 30 KeepAliveTime = 120 TcpMaxDataRetransmissions = 8 TcpTimedWaitDelay=30
net.inet.tcp.keepidle=120000 net.inet.tcp.keepintvl=30000 net.inet.tcp.keepcnt=8