Dedicated JNLP agents Troubleshooting guide

5 minute read

Symptoms

  • I am not able to connect a JNLP agent to a Jenkins Instance

  • The build has failed because the connection got broken

  • The build is stalled in the queue waiting for the agent

  • The agent is disconnected and cannot connect again

  • Channel is broken warning at logs

  • Any of the exceptions listed below

Diagnosis/Treatment

There are some required data before starting to diagnose the issue that needs to be provided:

1. Requirements

1.A Ensure that the Java version is at least on the same line on both controller and agent

A good practice is to run the exactly same Java version in both Jenkins and agent, but when this is not possible it is mandatory to be running at least the same base line (major version coordinate). Check Supported JDK for CloudBees Core.

Run java -version in both Jenkins controller box and agent to check the java version you are running in both.

1.B Ensure that the version of agent.jar matches with the one

The main problem of running JNLP as an agent Launcher is that when you upgrade Jenkins agent.jar is not automatically upgraded on the agent it happens in SSH Launcher out of the box. It can be solved in Windows by using JNLP + winsw adding the Remoting executable in <download from="${JENKINS_URL}/jnlpJars/agent.jar" to="%BASE%\agent.jar"/>.

Check that agent.jar is the same using for example md5sum agent.jar. agent.jar can be downloaded from Jenkins controller from the URL below:

https://<JENKINS_URL>/jnlpJars/agent.jar

Partial solutions:

1.C Connectivities checks

Use jenkins-cli to check the connection

In the agent box, download the CLI and run a help command in your favorite mode. For example, using http mode:

java -jar jenkins-cli.jar [-s $JENKINS_URL] -auth <user>:<token> help
Check that the agent is able to see the JENKINS headers
# curl -IvL <JENKINS_URL> curl -IvL https://jenkins:8443

For Windows, curl command can be available on a Windows box using for example curl Download Wizard or cwyng.

Check that the JNLP port is accessible to the agent
# telnet <JENKINS_HOST> <JNLP_PORT> telnet jenkins.host.example.com 50234

2. Use a different Launch mechanism

For Jenkins >= 2.204.1 LTS, switch to a different Launch mechanism: Connect directly to TCP port.

3. Known issues

3.A. Unable to load class once the loading was interrupted

JENKINS-36991 Unable to load class once the loading was interrupted is resolved and Released in remoting 2.61.

To confirm what remoting version your agent.jar (formerly slave.jar) file is currently tied to, run the following command in the same directory as your .jar file and check the parameter REMOTING_VERSION in output:

jar xf agent.jar META-INF/MANIFEST.MF
more META-INF/MANIFEST.MF
Jenkins log / Build console output log
java.lang.NoClassDefFoundError: Could not initialize class jenkins.model.Jenkins at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:191) at Script1.class$(Script1.groovy) at Script1.$get$$class$jenkins$model$Jenkins(Script1.groovy) at Script1.run(Script1.groovy:1) at groovy.lang.GroovyShell.evaluate(GroovyShell.java:580) at groovy.lang.GroovyShell.evaluate(GroovyShell.java:618) at groovy.lang.GroovyShell.evaluate(GroovyShell.java:589) at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:142) at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:114) at hudson.remoting.UserRequest.perform(UserRequest.java:121) at hudson.remoting.UserRequest.perform(UserRequest.java:49) at hudson.remoting.Request$2.run(Request.java:326) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
Agent log
Slave.jar version: 2.52 This is a Unix slave Evacuated stdout Slave successfully connected and online Jul 27, 2016 8:36:57 AM jenkins.model.Jenkins <clinit> SEVERE: Failed to load Jenkins.class hudson.remoting.RemotingSystemException: java.lang.InterruptedException at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:266) at com.sun.proxy.$Proxy5.fetch3(Unknown Source) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:171) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at com.thoughtworks.xstream.XStream.buildMapper(XStream.java:590) at com.thoughtworks.xstream.XStream.<init>(XStream.java:568) at com.thoughtworks.xstream.XStream.<init>(XStream.java:496) at com.thoughtworks.xstream.XStream.<init>(XStream.java:465) at com.thoughtworks.xstream.XStream.<init>(XStream.java:411) at com.thoughtworks.xstream.XStream.<init>(XStream.java:350) at hudson.util.XStream2.<init>(XStream2.java:88) at jenkins.model.Jenkins.<clinit>(Jenkins.java:4217) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:191) at Script1.class$(Script1.groovy) at Script1.$get$$class$jenkins$model$Jenkins(Script1.groovy) at Script1.run(Script1.groovy:1) at groovy.lang.GroovyShell.evaluate(GroovyShell.java:580) at groovy.lang.GroovyShell.evaluate(GroovyShell.java:618) at groovy.lang.GroovyShell.evaluate(GroovyShell.java:589) at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:142) at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:114) at hudson.remoting.UserRequest.perform(UserRequest.java:121) at hudson.remoting.UserRequest.perform(UserRequest.java:49) at hudson.remoting.Request$2.run(Request.java:326) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:147) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:253) ... 30 more

3.B. Intermittent Invalid Object ID in remoting module

JENKINS-23271 Intermittent Invalid Object ID in remoting module

It’s fixed and released on Jenkins core higher than 2.32

Happens frequently on Java 8 due its object management logic. Causes issues in task execution (build failures, agent disconnects)

Jenkins log / Build console output log
FATAL: Invalid object ID 18649 iuota=18470
java.lang.IllegalStateException: Invalid object ID 18469 iota=18470
at hudson.remoting.ExportTable.diagnoseInvalidId(ExportTable.java:277)

3.C. Ping Thread

Check the Ping Thread Documentation here.

PingThread checks that agent is ABLE to execute a command from controller (NOOP request)

Ping command may fail to execute:

  • Overloaded queue, all agent workers are busy → On big boxes you can increase the number of remoting TaskPool workers

  • Network overloaded

In some cases disabling can help

So, if this is the stacktrace you are seeing all the time, you should then disable the PingThread. The side effect is just that the agent is suppose to hung in case the communication is failing between controller and agents. The good side is that you will be able to get a thread dump on both sides controller and agent.

Jenkins log / Build console output log
Caused by: java.io.IOException     at hudson.remoting.Channel.close(Channel.java:1163)     at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118)     at hudson.remoting.PingThread.ping(PingThread.java:126)     at hudson.remoting.PingThread.run(PingThread.java:85) Caused by: java.util.concurrent.TimeoutException: Ping started at 1474633728617 hasn't completed by 1474633968617     ... 2 more

3.D. JNLP Cloud Agents are disconnected on start process

It affects Jenkins core higher than 2.28

Relax requirements of the JNLP connection receiver, which was rejections connections from agents not using JNLPComputerLauncher (e.g. from Agent Setup, vSphere Cloud and other plugins). No the connection is accepted from launchers implementing other proxying and filtering Launcher implementations. Particular plugins may require setting up the -Djenkins.slaves.DefaultJnlpSlaveReceiver.disableStrictVerification=true system property in the controller JVM to allow connecting agents. JENKINS-39232, regression in 2.28

4. HA / LB / Reverse proxy bypass

5. Clear the Java Web Start Cache

If, when starting the JNLP file, you see an error like the one below, run the command javaws -clearcache to clear the cache of the java webstart program.

java.net.SocketException: Connection reset at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at sun.security.ssl.InputRecord.readFully(Unknown Source) at sun.security.ssl.InputRecord.read(Unknown Source) at sun.security.ssl.SSLSocketImpl.readRecord(Unknown Source) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(Unknown Source) at sun.security.ssl.SSLSocketImpl.startHandshake(Unknown Source) at sun.security.ssl.SSLSocketImpl.startHandshake(Unknown Source) at sun.net.www.protocol.https.HttpsClient.afterConnect(Unknown Source) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(Unknown Source) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source) at sun.net.www.protocol.http.HttpURLConnection.access$200(Unknown Source) at sun.net.www.protocol.http.HttpURLConnection$9.run(Unknown Source) at sun.net.www.protocol.http.HttpURLConnection$9.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.security.AccessController.doPrivilegedWithCombiner(Unknown Source) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source) at com.sun.deploy.net.HttpUtils.followRedirects(Unknown Source) at com.sun.deploy.net.BasicHttpRequest.doRequest(Unknown Source) at com.sun.deploy.net.BasicHttpRequest.doGetRequestEX(Unknown Source) at com.sun.deploy.cache.ResourceProviderImpl.checkUpdateAvailable(Unknown Source) at com.sun.deploy.cache.ResourceProviderImpl.isUpdateAvailable(Unknown Source) at com.sun.deploy.cache.ResourceProviderImpl.getResource(Unknown Source) at com.sun.deploy.cache.ResourceProviderImpl.getResource(Unknown Source) at com.sun.javaws.LaunchDownload$DownloadTask.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

6. JNLP windows agent

Runaway agent process

  • In particular cases jenkins-agent.exe gets forcibly terminated (user action, fatal remoting failure, windows service hardstop)

  • Java.exe running agent may be leaked

  • It causes multiple "slave is already connected" messages in the Jenkins log

7. TCP retransmission timeout OSS - perhaps increase

7.A Linux

sysctl -w net.ipv4.tcp_keepalive_time=120 sysctl -w net.ipv4.tcp_keepalive_intvl=30 sysctl -w net.ipv4.tcp_keepalive_probes=8 sysctl -w net.ipv4.tcp_fin_timeout=30

7.B Windows

KeepAliveInterval = 30000 KeepAliveTime = 120000 TcpMaxDataRetransmissions = 8 TcpTimedWaitDelay=30

7.C Mac

net.inet.tcp.keepidle=120000 net.inet.tcp.keepintvl=30000 net.inet.tcp.keepcnt=8
remoting 2.62.1 has an improvement wrt to keepalive from the client (agent) side

8. When all fails

  • Try to add this Java property on controller -Djenkins.slaves.NioChannelSelector.disabled=true

  • Still I/O available and it complicates and improve the performance

  • Try to add this Java property on controller -Djenkins.slaves.JnlpSlaveAgentProtocol3.enabled=false

9. When no secret is included in connection string

When the agent launch command is missing -secret or you experience below stacktrace during agent connection, it is normally a result of permission set for system user anonymous.

Failing to obtain $JENKINS_URL/computer/$AGENT_NAME/jenkins-agent.jnlp
java.io.IOException: Failed to load $JENKINS_URL/computer/$AGENT_NAME/jenkins-agent.jnlp: 403 Forbidden
at hudson.remoting.Launcher.parseJnlpArguments(Launcher.java:521)
at hudson.remoting.Launcher.run(Launcher.java:347)
at hudson.remoting.Launcher.main(Launcher.java:298)
Waiting 10 seconds before retry

Removing all agent permissions under Manage Jenkins > Manage Roles for the system user anonymous should resolve the issue.

image