Kubernetes agents are failing with 'SocketTimeoutException: sent ping but didn’t receive pong'

Issue

My pods are getting created but some pods are failing or getting disconnected with an error similar to the following in the console output or controller logs:

java.net.SocketTimeoutException: sent ping but didn't receive pong within XXXXms (after XX successful ping/pongs)
    at okhttp3.internal.ws.RealWebSocket.writePingFrame(RealWebSocket.java:546)
    at okhttp3.internal.ws.RealWebSocket$PingRunnable.run(RealWebSocket.java:530)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

Environment

CloudBees CI (CloudBees Core) from 2.164.3.2 to 2.190.2.2
CloudBees CI (CloudBees Core) on modern cloud platforms - Managed controller from 2.164.3.2 to 2.190.2.2
CloudBees CI (CloudBees Core) on modern cloud platforms - Operations Center from 2.164.3.2 to 2.190.2.2
CloudBees CI (CloudBees Core) on traditional platforms - Client controller from 2.164.3.2 to 2.190.2.2
CloudBees CI (CloudBees Core) on traditional platforms - Operations Center from 2.164.3.2 to 2.190.2.2
Kubernetes Plugin from 1.14.8 to 1.19.3

Related Issue(s)

okhttp 3.10.0 is more aggressive on ping interval.
JENKINS-50429 (issue introduced): Kubernetes plugin uses fabric8/kubernetes-client > 4.1.2 that uses okhttp > 3.10.0
JENKINS-58301 (issue reported)
fabric8io/kubernetes-client #1767 fixed in fabric8/kubernetes-client 4.6.0

Explanation

The exception java.net.SocketTimeoutException: sent ping but didn’t receive pong within XXXXms (after XX successful ping/pongs) is caused by a ping failure from the HTTP client that maintain the connection to kubernetes through the okhttp library. The Kubernetes plugin relies on the fabric8/kubernetes-client that relies on the okhttp library.

Since its early days, the fabric8/kubernetes-client is setting the ping interval to 1ms by default. This was "stable" up until version 4.1.2, when it started to use version 3.10.0 of the okhttp library. Starting from this version, the http client connection is closed on any ping failure and the recommendation is to use a value of 30s for a ping interval or greater if necessary. Since fabric8/kubernetes-client uses a value of 1ms, the connection is rather unstable depending on the load of the network, the client and the server. The Jenkins Kubernetes plugin is impacted since version 1.14.8.

In Jenkins, the symptoms are agent connection failure or agent disconnection due to a socket timeout: java.net.SocketTimeoutException: sent ping but didn’t receive pong within XXXXms. The exception may appear in the agent logs or even a build console output.

If the value is 1000ms or something lower than the recommended 30000ms then this is most likely the issue. If the value is greater or equal to 30000ms, then it could well be an underlying network problem or performance on one end of the connection (Jenkins controller or agent unresponsive).

Resolution

The solution is to upgrade the Kubernetes plugin to version 1.19.3. That is available under the CloudBees Assurance Program since version 2.190.2.2 of CloudBees Core.

Workaround

It is possible to increase timeout interval by adding -Dkubernetes.websocket.ping.interval=<miliseconds> to the startup of the impacted instance(s). The recommended interval from okhttp documentation is 30 seconds (as the argument unit are milliseconds, you may configure 30000).

Take a look at how to add Java arguments to Jenkins to know how to add this argument to your environment.