Dedicated SSH agent gets disconnected

3 minute read

Issue

Precondition: The agent was working properly before the issue

  • The build has failed because the connection between the controller and the agent got broken

  • The build is stalled in the queue waiting for the agent

  • The agent is disconnected and cannot connect again

  • Channel is broken warning at logs

  • Any of the exceptions listed below

Resolution

Before trying anything else, a manual ssh connection from the Jenkins server shell to the agent has to be performed, in order to discard basic connectivity problems.

You could use this command that will verify the connection periodically.

ssh user@agent_ip "bash -c 'while true ; do export sleeptime=$(( 60*60 )) ; echo "Hey, time is '$(date)'. Now sleeping for \$sleeptime seconds" ; sleep \$sleeptime ; done'"

There are some things required before starting to diagnose the issue that needs to be provided:

  1. A support bundle, ideally generated when the agent is connected. Make sure to tick the Agent Configuration File, Agent Log Recorders and Controller Log Recorders checkboxes.

  2. The Agent configuration file (config.xml). The file can be recovered either from $JENKINS_HOME/nodes/[agent_name]/config.xml or from a browser at http(s)://JENKINS_SERVER/computer/[agent_name]/config.xml

  3. The Agent logs. If you managed to capture a support bundle while the agent was online then you can skip this section. Otherwise you need to follow those additional steps:

    • If you are using bash, add Suffix Start Agent Command:  2> >(tee -a $( date '+test.%H-%M-%S.txt' ) 1>&2) Notice that there is a space before the 2.

    • If you are using ksh

      • Prefix Start Agent Command: bash -c "

      • Suffix Start Agent Command:  2> >(tee -a agent-stderr.log 1>&2)" (you need to add a space at the beginning)

    • If you are using the CloudBees SSH Build Agents plugin, then you also need to activate the Log environment on initial connect in the agent configuration page. Make sure to disable it once you are finished resolving the issue as the option can impact the agent startup performance.

When the issue happens, collect the 3 items as well as the build console log output of an impacted job (if applicable).

About the Ping Thread

Check the Ping Thread Documentation here.

The PingThread checks that agent is ABLE to execute a command from the controller (NOOP request)

Ping command may fail to execute:

  • Overloaded queue, all agent workers are busy → On big boxes you can increase the number of remoting TaskPool workers

  • Network overloaded

In some cases disabling can help

So, if this is the stacktrace you are seeing all the time, you should then disable the PingThread. The side effect is just that the agent is supposed to hung in case the communication is really failing between controller and agents, but this is good as you will then use jstack to take a threadDump on both sides controller and agent it self.

Here is an example of a stack caused by a PingThread failure:

Caused by: java.io.IOException
    at hudson.remoting.Channel.close(Channel.java:1163)
    at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118)
    at hudson.remoting.PingThread.ping(PingThread.java:126)
    at hudson.remoting.PingThread.run(PingThread.java:85)
Caused by: java.util.concurrent.TimeoutException: Ping started at 1474633728617 hasn't completed by 1474633968617
    ... 2 more

Workaround

Try to use the other agent implementation (move from SSH Build Agents plugin to CloudBees SSH Build Agents plugin or vice versa).

References

TCP retransmission timeout

Linux
sysctl -w net.ipv4.tcp_keepalive_time=120
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=8
sysctl -w net.ipv4.tcp_fin_timeout=30
Windows
KeepAliveInterval = 30
KeepAliveTime = 120
TcpMaxDataRetransmissions = 8
TcpTimedWaitDelay=30
Mac
net.inet.tcp.keepidle=120000
net.inet.tcp.keepintvl=30000
net.inet.tcp.keepcnt=8