Issue
-
My pods are getting created but some pods are failing or getting disconnected with an error similar to the following in the console output or controller logs:
java.net.SocketTimeoutException: sent ping but didn't receive pong within XXXXms (after XX successful ping/pongs) at okhttp3.internal.ws.RealWebSocket.writePingFrame(RealWebSocket.java:546) at okhttp3.internal.ws.RealWebSocket$PingRunnable.run(RealWebSocket.java:530) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Environment
-
CloudBees CI (CloudBees Core) from 2.164.3.2 to 2.190.2.2
-
CloudBees CI (CloudBees Core) on modern cloud platforms - Managed controller from 2.164.3.2 to 2.190.2.2
-
CloudBees CI (CloudBees Core) on modern cloud platforms - Operations Center from 2.164.3.2 to 2.190.2.2
-
CloudBees CI (CloudBees Core) on traditional platforms - Client controller from 2.164.3.2 to 2.190.2.2
-
CloudBees CI (CloudBees Core) on traditional platforms - Operations Center from 2.164.3.2 to 2.190.2.2
-
Kubernetes Plugin from 1.14.8 to 1.19.3
Related Issue(s)
-
okhttp 3.10.0 is more aggressive on ping interval.
-
JENKINS-50429 (issue introduced): Kubernetes plugin uses
fabric8/kubernetes-client
> 4.1.2 that usesokhttp
> 3.10.0 -
JENKINS-58301 (issue reported)
-
fabric8io/kubernetes-client #1767 fixed in
fabric8/kubernetes-client
4.6.0
Explanation
The exception java.net.SocketTimeoutException: sent ping but didn’t receive pong within XXXXms (after XX successful ping/pongs)
is caused by a ping failure from the HTTP client that maintain the connection to kubernetes through the okhttp library. The Kubernetes plugin relies on the fabric8/kubernetes-client that relies on the okhttp
library.
Since its early days, the fabric8/kubernetes-client
is setting the ping interval to 1ms by default. This was "stable" up until version 4.1.2, when it started to use version 3.10.0 of the okhttp
library. Starting from this version, the http client connection is closed on any ping failure and the recommendation is to use a value of 30s for a ping interval or greater if necessary. Since fabric8/kubernetes-client
uses a value of 1ms
, the connection is rather unstable depending on the load of the network, the client and the server. The Jenkins Kubernetes plugin is impacted since version 1.14.8.
In Jenkins, the symptoms are agent connection failure or agent disconnection due to a socket timeout: java.net.SocketTimeoutException: sent ping but didn’t receive pong within XXXXms
. The exception may appear in the agent logs or even a build console output.
If the value is 1000ms
or something lower than the recommended 30000ms
then this is most likely the issue.
If the value is greater or equal to 30000ms
, then it could well be an underlying network problem or performance on one end of the connection (Jenkins controller or agent unresponsive).
Resolution
The solution is to upgrade the Kubernetes plugin to version 1.19.3. That is available under the CloudBees Assurance Program since version 2.190.2.2 of CloudBees Core.
Workaround
It is possible to increase timeout interval by adding -Dkubernetes.websocket.ping.interval=<miliseconds>
to the startup of the impacted instance(s). The recommended interval from okhttp
documentation is 30 seconds (as the argument unit are milliseconds, you may configure 30000
).
Take a look at how to add Java arguments to Jenkins to know how to add this argument to your environment.