Considerations for Kubernetes Clients Connections when using Kubernetes Plugin

Issue

The controller thread dump shows many OkHttp ConnectionPool and / or OkHttp WebSocket threads

Build execution fails with web socket exception such as

  Interrupted while waiting for websocket connection, you should increase the Max connections to Kubernetes API

  Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at <currentTimeout> seconds

  io.fabric8.kubernetes.client.KubernetesClientException: not ready after <currentTimeout> MILLISECONDS

Kubernetes client requests are enqueued

Environment

Related Issue(s)

Explanation

The Kubernetes plugin manages references of Kubernetes Clients to talk to the Kubernetes API Server. In general, there is a reference of one active Kubernetes Client per Kubernetes Cloud.

Each kubernetes Client can handle requests concurrently and to prevent Jenkins from overloading the Kubernetes API Server, there is a configurable limit to the number of concurrent connections that a Kubernetes Client / Cloud can make to the Kubernetes API server. It is labelled as Max connections to Kubernetes API in a Kubernetes Cloud advanced configuration, and the default value is 32. This sets the maxRequestsPerHost and the maxRequests to the client dispatcher.

The Kubernetes plugin needs to send requests to the Kubernetes API Server for different kinds of operations, mainly to manage agent pods but also to execute steps inside NON jnlp containers. In fact every time a durable task step or a step that use a Launcher is executed in a container that is not the jnlp container, calls are made to the Kubernetes API. The Kubernetes plugin relies on the /exec API using a WebSocket connection to execute those steps. In case of durable task steps - such as a sh / bat / powershell step - the connection is intended to be opened only while the step is being launched and then quickly closed (even if step runs for hours). However, other steps that use a Launcher, such as some publishers and checkout with some SCMs but not Git or Subversion, will hold open this WebSocket connection for the duration of the step.

When the limit is reached, requests are still submitted to the dispatcher but enqueued until a dispatcher thread is available to handle it. Depending on the operations, a timeout applies, waiting for the connection to eventually be handled, before it fails. For example org.csanchez.jenkins.plugins.kubernetes.pipeline.websocketConnectionTimeout is the timeout that apply when waiting for the WebSocket connection to succeed to be able to launch a durable task inside a container block. When a pipeline fails due to this timeout, it may mean that there are currently too many concurrent requests and that this particular request was in the client dispatcher queue for the duration of that timeout.

Since version 1.31.0 of the Kubernetes plugin, the websocket connections are handled asynchronously at an even deeper level and the same error is reported differently. The error message io.fabric8.kubernetes.client.KubernetesClientException: not ready after <currentTimeout> MILLISECONDS would generally show up instead of [...] you should increase the Max connections to Kubernetes API, though still reporting the same problem. The kubernetes.websocket.timeout is the timeout that applies this deeper level when waiting for a queued asynchronous websocket connection to be completed by an available thread.

Lastly, the current implementation of the container step is known to be fragile. It relies on the Kubernetes /exec API to launch commands and stream stdin / stderr / stdout when doing so. Many factors can impact the functionality depending on the environment. The communication between the controller and the kube-apiserver through the /exec api relies on the responsiveness and health of the kubernetes nodes (kubelet) and also on other kubernetes specific behaviours that maintain the connection. Sporadic connection problems may occur in Kubernetes, that would be reported as java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'. Pod eviction, kubelet restarts, container runtime unresponsive are examples of problems that could cause this.

To sum up on this:

Each Kubernetes Cloud has a reference to an active Kubernetes Client
Each Kubernetes Cloud has it own limit Max connections to Kubernetes API to the number of concurrent requests that it can make to the Kubernetes API Server
By default, the max number of concurrent request per Kubernetes Cloud is 32
Agent pod maintenance and Pipeline steps execution in container blocks are the most common operations that require Kubernetes API Server connections
Durable Task steps (sh / bat / powershell steps) executed in container blocks open a WebSocket connection to launch the step and close it quickly.
Other steps which use a Launcher hold the connection for the duration of the step.
Kubernetes plugin timeouts such as org.csanchez.jenkins.plugins.kubernetes.pipeline.websocketConnectionTimeout are likely to fail when the limit is reached for too long
Steps in container blocks are launched using the Kubernetes /exec API. This is fragile by design and may be subjected to sporadic server connection issues.

Therefore, depending on a few things - such as the activity on the controller, Kubernetes API Server and Nodes responsiveness, the way pipelines are designed and step execution times - this limit may be reached rather easily. A workaround is to increase the limit, at the expense of overloading the Kubernetes API Server and node components. Other practices can be followed to avoid reaching this limit.

Notable Fixes / Improvements

There are a couple of critical improvement that have been made to improve the behavior around the consumption of Kubernetes API Calls

Kubernetes Plugin 1.16.6 / Durable Task 1.30: Improve durable tasks behavior so that a durable task step execution does not hold a connection for the entire execution of the step. See JENKINS-58290 for more details.
Kubernetes Plugin 1.27.1: Remove a redundant call to the Kubernetes API Server that checks that all containers are READY every time a Launcher is executed in container. See #826 for more details.
Kubernetes Plugin 1.27.3: Fix the Max connections to Kubernetes API that was limited to 64. See JENKINS-58463 for more details.
Kubernetes Plugin 1.28.1: Expired (closed) clients are sometimes used. See #889 for more details.
Kubernetes Plugin 1.31.0: Upgrade to Kubernetes Client 5.10.1, connection issues are now reported as a generic KubernetesClientException: not ready after <currentTimeout> MILLISECONDS by the kubernetes plguin.
Kubernetes Plugin 3690.va_9ddf6635481: Introduction of a retry mechanism when launching steps in container blocks to mitigate sporadic issues when opening connections through the Exec API. See JENKINS-67664 and #1212 for more details.

We recommend staying up to date.

Solution

We recommend running at least version 3690.va_9ddf6635481_ or later of the Kubernetes Plugin. That is available in CloudBees CI on modern cloud platforms 2.361.1.2.

There are different solutions that can help mitigate the problem:

Increase the Max connections to Kubernetes API
Increase the ContainerExecDecorator#websocketConnectionTimeout
Increase the kubernetes.websocket.timeout (since Kubernetes plugin 1.31.0)
Use a single container approach
Concatenate small and successive durable task steps
Properly manage agent pods resources

Increase the Max Concurrent requests

This is the most straightforward solution / workaround:

Consider increasing the Max connections to Kubernetes API to allow more concurrent requests to be made concurrently. In CloudBees CI on Modern Platform, this must be done in the "kubernetes shared cloud" in Operations Center.

Although this sounds like the solution, do not raise this setting to a very large number at once as this could overload the Kubernetes API Server as well as the controller - more concurrent requests means that more resources are needed. Increase the value gradually to find a good spot. Above all it is important to understand what is causing the limit to be reached by monitoring the number of running / queued requests, see [Monitor Usage][#monitorusage] section below.

Increase the ContainerExecDecorator#websocketConnectionTimeout

This helps mitigate the problem in cases where builds are failing with Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at <currentTimeout> seconds but does not address the concurrent request usage:

Increase the value the container block websocket connection timeout within the container block by adding the system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout=<timeInSeconds> on start up. The default value is 30 for 30 seconds. This requires a restart of the controller. See How to add Java arguments to Jenkins for more details.

This timeout allows a tasks to be enqueued for a certain amount of time before it fails.

Increase the kubernetes.websocket.timeout

This helps mitigate the problem in cases where builds are failing with io.fabric8.kubernetes.client.KubernetesClientException: not ready after <currentTimeout> MILLISECONDS but does not address the concurrent request usage:

Increase the value the Kubernetes client websocket connection timeout within the container block by adding the system property kubernetes.websocket.timeout=<timeInMilliseconds> on start up. The default value is 5000 for 5 seconds. This requires a restart of the controller. See How to add Java arguments to Jenkins for more details.

This timeout allows a tasks to be enqueued for a certain amount of time before it fails.

Use a Single Container approach

This is the reliable solution / workaround.

Running steps in the jnlp container helps to reduce the number of calls made to the API Server tremendously. And also avoid sporadic connection issues. As those executions are using the remoting channel and do not rely on the kubernetes API to launch tasks. There is no general recommendation around using a single-container or a multi-container approach, although in that particular case using a single container approach helps tremendously.

Consider using a single container approach: building a jnlp image that contains the build tools required to be able to run steps in the jnlp container

when creating custom jnlp images, we recommend using cloudbees/cloudbees-core-agent as a base.

Concatenate small and successive Durable Tasks steps

Every time a durable task step such as a sh / bat / powershell step is executed in a container that is not the jnlp container, calls are made to the Kubernetes API.

It is recommended to concatenate small and successive durable task steps into larger ones to avoid unnecessary calls, increase stability and improve scalability.

For example, each of the following sh requires several successive API calls to be made:

  container('notjnlp') {
      sh "echo 'this'"
      sh "echo 'that'"
      sh "grep 'this' that | jq ."
  }▼

As opposed to the concatenated version that requires a single call:

  [...]
  container('notjnlp') {
      sh """
        echo 'this'
        echo 'that'
        grep 'this' that | jq .
      """
  }▼

Properly manage agent pods resources

Ensure that Kubernetes nodes are not overloaded and stay healthy. If a node is stressed or overloaded, this can impair the communication between the Jenkins controller and the pod agent containers and increase the likelihood of sporadic server errors. In many cases, resource management is the problem. The workload executed by kubernetes agent pod must be evaluated, resource requestes (and eventually limits) should be used.

It is recommended to correctly size at least the resource requests (such a memory and cpu) of the agent pod containers to avoid overloading Kubernetes nodes

Refer to CloudBees CI on modern cloud platforms Best Practices for Reliability - Resources Management.

Check usage

Kubernetes Plugin 1.31.1 and later

To have a better idea of what requests are currently queued and running, the following groovy script may be executed under Manage Jenkins Script Console:

def allRunningCount = 0
def allQueuedCount = 0

/**
 * Method that dumps information of a specific k8s client.
 */
def dumpClientConsumer = { k8sPluginClient ->
  def k8sClient = k8sPluginClient.client
  def okHttpClient = k8sClient.httpClient
  def httpClient = okHttpClient.httpClient
  def dispatcher = httpClient.dispatcher()

  allRunningCount += dispatcher.runningCallsCount()
  allQueuedCount += dispatcher.queuedCallsCount()

  println "(${k8sClient})"
  println "* STATE "
  println "  * validity " + k8sPluginClient.validity
  def runningCalls = dispatcher.runningCalls()
  println "* RUNNING " + runningCalls.size()
  runningCalls.each { call ->
    println "  * " + call.request()
  }
  def queuedCalls = dispatcher.queuedCalls()
  println "* QUEUED " + queuedCalls.size()
  queuedCalls.each { call ->
    println "  * " + call.request()
  }
  println "* SETTINGS "
  println "  * Connect Timeout (ms): " + httpClient.connectTimeoutMillis()
  println "  * Read Timeout (ms): " + httpClient.readTimeoutMillis()
  println "  * Write Timeout (ms): " + httpClient.writeTimeoutMillis()
  println "  * Ping Interval (ms): " + httpClient.pingIntervalMillis()
  println "  * Retry on failure " + httpClient.retryOnConnectionFailure()
  println "  * Max Concurrent Requests: " + dispatcher.getMaxRequests()
  println "  * Max Concurrent Requests per Host: " + dispatcher.getMaxRequestsPerHost()
  def connectionPool = httpClient.connectionPool()
  println "* CONNECTION POOL "
  println "  * Active Connection " + connectionPool.connectionCount()
  println "  * Idle Connection " + connectionPool.idleConnectionCount()
  println ""
}

println "Active K8s Clients\n----------"
org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider.clients.asMap().values().forEach(dumpClientConsumer)

println ""
println "K8s Clients Summary\n----------"
println "* ${org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider.clients.asMap().size()} active clients"
println "* ${allRunningCount} running calls"
println "* ${allQueuedCount} queued calls"

return▼

Before Kubernetes Plugin 1.31.1

If using CloudBees CI, the Kube Agents Management plugin exposes JMX metrics to track the kubernetes clients running and queued tasks kubernetes-client.connections.running and kubernetes-client.connections.queued that can help monitor instances, evaluate requirements for concurrent requests and narrow down the root cause of related issues.

To have a better idea of what requests are currently queued and running, the following groovy script may be executed under Manage Jenkins Script Console:

def allRunningCount = 0
def allQueuedCount = 0

/**
 * Method that dumps information of a specific k8s client.
 */
def dumpClientConsumer = { client ->
  def k8sClient = client.client
  def httpClient = k8sClient.httpClient
  def dispatcher = httpClient.dispatcher()

  allRunningCount += dispatcher.runningCallsCount()
  allQueuedCount += dispatcher.queuedCallsCount()

  println "(${k8sClient})"
  println "* STATE "
  println "  * validity " + client.validity
  def runningCalls = dispatcher.runningCalls()
  println "* RUNNING " + runningCalls.size()
  runningCalls.each { call ->
    println "  * " + call.request()
  }
  def queuedCalls = dispatcher.queuedCalls()
  println "* QUEUED " + queuedCalls.size()
  queuedCalls.each { call ->
    println "  * " + call.request()
  }
  println "* SETTINGS "
  println "  * Connect Timeout (ms): " + httpClient.connectTimeoutMillis()
  println "  * Read Timeout (ms): " + httpClient.readTimeoutMillis()
  println "  * Write Timeout (ms): " + httpClient.writeTimeoutMillis()
  println "  * Ping Interval (ms): " + httpClient.pingIntervalMillis()
  println "  * Retry on failure " + httpClient.retryOnConnectionFailure()
  println "  * Max Concurrent Requests: " + dispatcher.getMaxRequests()
  println "  * Max Concurrent Requests per Host: " + dispatcher.getMaxRequestsPerHost()
  def connectionPool = httpClient.connectionPool()
  println "* CONNECTION POOL "
  println "  * Active Connection " + connectionPool.connectionCount()
  println "  * Idle Connection " + connectionPool.idleConnectionCount()
  println ""
}

println "Active K8s Clients\n----------"
org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider.clients.asMap().values().forEach(dumpClientConsumer)

println ""
println "K8s Clients Summary\n----------"
println "* ${org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider.clients.asMap().size()} active clients"
println "* ${org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider.runningCallsCount} running calls (from plugin)"
println "* ${org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider.queuedCallsCount} queued calls (from plugin)"
println "* ${allRunningCount} running calls"
println "* ${allQueuedCount} queued calls"

return▼