Issue
-
Past version 2.361.x, websocket agents are being disconnected intermittently. The controller shows exceptions like the following:
org.eclipse.jetty.websocket.api.exceptions.WebSocketTimeoutException: Connection Idle Timeout [...] at org.eclipse.jetty.io.AbstractEndPoint.onIdleExpired(AbstractEndPoint.java:407) at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:170) at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:112) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.eclipse.jetty.websocket.core.exception.WebSocketTimeoutException: Connection Idle Timeout
Environment
-
CloudBees CI on modern cloud platforms - operations center >= 2.361.1 and < 2.387.2
-
CloudBees CI on modern cloud platforms - managed controller >= 2.361.1 and < 2.387.2
-
CloudBees CI on traditional platforms - operations center >= 2.361.1 and < 2.387.2
-
CloudBees CI on traditional platforms - client controller >= 2.361.1 and < 2.387.2
-
Jenkins >= 2.361 and < 2.387.2
-
Inbound Agents Websockets
Explanation
In 2.361.1, Jenkins uses Jetty 10 that went through a significant rewrite of the Websocket connection handling. In particular, Websocket connection managed by Jetty are subjected to an idle timeout of 30 seconds (as opposed to an idle timeout of 5 minutes in Jetty 9). Given that the websocket connections of inbound agent are kept active by a server ping sent with a 30 seconds by default, this gives little room for errors caused by JVM performance or network latency. It is very likely that agent websocket connection be closed over time. Therefore this is a configuration issue introduced in 2.361.1, the ping interval must guarantee that websocket connections are kept alive and therefore be inferior to the idle timeout.
What is more is that several issues related to the handling of websocket channel closure and reconnection have been discovered since. Most likely related to the fact that websocket agents are being disconnected more often due to JENKINS-69955: WebSocketTimeoutException: Connection Idle Timeout:
-
JENKINS-70103: Agent websocket sessions not cleared up on connection close fixed in
2.375.2
-
JENKINS-70105: Slow down re-connects when using websockets fixed in
2.375.2
-
JENKINS-69890: Websocket Agent disconnection: possible deadlock at agent side fixed in
2.375.1
-
JENKINS-70414: Ping thread failures on agent side were ignored fixed in remoting
3107.v665000b_51092
will be included in2.387
LTS and workaround is to use-Dhudson.remoting.Launcher.pingIntervalSec=600
to reactivate the agent ping thread.
Resolution
This problem is fixed in Jenkins weekly 2.395 and backported in Jenkins LTS 2.387.2.
The solution is to upgrade CloudBees CI to version 2.387.2.3 or later.
Workaround
The workaround if to set the ping interval to 15 seconds or lower by adding the system property -Djenkins.websocket.pingInterval=15
to the controller JVM.
For more information on how to add startup arguments, refer to How to add Java arguments to Jenkins.