Dedicated SSH agent gets disconnected

Issue

Precondition: The agent was working properly before the issue

The build has failed because the connection between the controller and the agent got broken
The build is stalled in the queue waiting for the agent
The agent is disconnected and cannot connect again
Channel is broken warning at logs
Any of the exceptions listed below

Environment

Resolution

Before trying anything else, a manual ssh connection from the Jenkins server shell to the agent has to be performed, in order to discard basic connectivity problems.

You could use this command that will verify the connection periodically.

ssh user@agent_ip "bash -c 'while true ; do export sleeptime=$(( 60*60 )) ; echo "Hey, time is '$(date)'. Now sleeping for \$sleeptime seconds" ; sleep \$sleeptime ; done'"

There are some things required before starting to diagnose the issue that needs to be provided:

A support bundle, ideally generated when the agent is connected. Make sure to tick the Agent Configuration File, Agent Log Recorders and Controller Log Recorders checkboxes.
The Agent configuration file (config.xml). The file can be recovered either from $JENKINS_HOME/nodes/[agent_name]/config.xml or from a browser at http(s)://JENKINS_SERVER/computer/[agent_name]/config.xml
The Agent logs. If you managed to capture a support bundle while the agent was online then you can skip this section. Otherwise you need to follow those additional steps:
- If you are using bash, add Suffix Start Agent Command: 2> >(tee -a $( date '+test.%H-%M-%S.txt' ) 1>&2) Notice that there is a space before the 2.
- If you are using ksh
  - Prefix Start Agent Command: bash -c "
  - Suffix Start Agent Command: 2> >(tee -a agent-stderr.log 1>&2)" (you need to add a space at the beginning)
- If you are using the CloudBees SSH Build Agents plugin, then you also need to activate the Log environment on initial connect in the agent configuration page. Make sure to disable it once you are finished resolving the issue as the option can impact the agent startup performance.

When the issue happens, collect the 3 items as well as the build console log output of an impacted job (if applicable).

About the Ping Thread

Check the Ping Thread Documentation here.

The PingThread checks that agent is ABLE to execute a command from the controller (NOOP request)

Ping command may fail to execute:

Overloaded queue, all agent workers are busy → On big boxes you can increase the number of remoting TaskPool workers
Network overloaded

In some cases disabling can help

So, if this is the stacktrace you are seeing all the time, you should then disable the PingThread. The side effect is just that the agent is supposed to hung in case the communication is really failing between controller and agents, but this is good as you will then use jstack to take a threadDump on both sides controller and agent it self.

Here is an example of a stack caused by a PingThread failure:

Caused by: java.io.IOException
    at hudson.remoting.Channel.close(Channel.java:1163)
    at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118)
    at hudson.remoting.PingThread.ping(PingThread.java:126)
    at hudson.remoting.PingThread.run(PingThread.java:85)
Caused by: java.util.concurrent.TimeoutException: Ping started at 1474633728617 hasn't completed by 1474633968617
    ... 2 more

Workaround

Try to use the other agent implementation (move from SSH Build Agents plugin to CloudBees SSH Build Agents plugin or vice versa).

References

Remoting issue

TCP retransmission timeout

Linux

Using Keep Alive

sysctl -w net.ipv4.tcp_keepalive_time=120
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=8
sysctl -w net.ipv4.tcp_fin_timeout=30

Windows

Things that you may want to know about TCP Keepalives Avoiding TCP/IP Port Exhaustion

KeepAliveInterval = 30
KeepAliveTime = 120
TcpMaxDataRetransmissions = 8
TcpTimedWaitDelay=30

Mac

Using TCP keepalive to Detect Network Errors

net.inet.tcp.keepidle=120000
net.inet.tcp.keepintvl=30000
net.inet.tcp.keepcnt=8