Operations Center Client leaks HTTP Clients since version 2.401.1.3

4 minute readKnowledge base

Issue

  • After upgrading CloudBees CI Controller past version 2.401.1.3, the controller shows a lot of HttpClient--SelectorManager- / HttpClient--Worker- threads such as:

    "HttpClient-1234-SelectorManager" id=1234 (0xXXXX) state=RUNNABLE cpu=0% (running in native)
        at java.base@11.0.20/sun.nio.ch.EPoll.wait(Native Method)
        at java.base@11.0.20/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120)
        at java.base@11.0.20/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
          - locked sun.nio.ch.EPollSelectorImpl@68c7fc70
        at java.base@11.0.20/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
        at platform/java.net.http@11.0.20/jdk.internal.net.http.HttpClientImpl$SelectorManager.run(HttpClientImpl.java:867)

Explanation

Starting with CloudBees CI version 2.401.1.3, the Operations Center Client plugin leverages a new HTTP API in its communication with the Operations Center that uses the Jenkins core high-level HTTP client API and downstream the code relies on the native Java HTTP Client. Note the requests rate from a Controller to Operations Center varies depending on the environment and the activity. The main features that generate requests from controller to Operations Center include - but are not limited to - Operations Center Single Sign-on, Shared Agent / Shared Cloud Agent provisioning and Credential lookups.

The nature and the usage of the Java HTTP Client API in JDK < 21 presents some challenges. The HTTP Client is not "closeable" and the Java HTTP Client internals takes care of reclaiming the client resources. The client also requires some specific usage to avoid leaks (such as fully consuming the response body).

Chronology of Feature and Fixes

  • Introduction of the Operations Center HTTP transport API

  • Excessive HTTP client threads created: Prevent a Java HTTP Client leak by using a shared CachedThreadPool

  • Shared Agents and Operations Center credentials now use the HTTP transport by default

  • Fixed memory leak in Operations center: Fixed some wrong usage of Java HTTP Client that caused HTTP Clients leak (was undocumented until JDK-8327991)

Remaining Issues

By version 2.440.2.1 most of the known issues have been addressed but there are remaining scalability problems that require Java 21 to be addressed:

In particular, the following behaviors are critical:

  • The Java HTTP Client has a default idle timeout of 20 minutes. In the context of the Operations Center Client feature, every time a call is made to Operations Center, a client and the connection it established are leaked for 20 minutes even if the requests have been completed (successfully or not).

  • Each Java HTTP Client use a SelectorManager thread pool that leak a thread until all references of the HTTP Client itself have been garbage collected.

This causes a pseudo thread leak as thread resources will not be reclaimed immediately but after a rather non-determined amount of time around the idle timeout. This consequently causes an open file leak and eventually general performance problems. The impact depends on the activity on the Controller. If the rate at which the controller calls the Operations Center is so high that it does not allow for the resource to be reclaimed in time, the thread count grows and the controller eventually suffers from critical performance issues.

System properties can be used to help the Java HTTP Client API to reclaim resources faster:

  • -Djdk.httpclient.keepalive.timeout=30: Reduce the default Keep-Alive timeout of HTTP Client from 1200 to 30 seconds (as per the fix in the JDK 21).

  • -Djdk.internal.httpclient.selectorTimeout=1000: Reduce the delay when the SelectorManager checks on HTTP Client reference from 3 seconds to 1 seconds.

But addressing the HttpClient timeouts is not enough in large and active environments.

System properties can be used to reduce the number of HTTP calls being made from Controller to Operations Center:

  • -Dcom.cloudbees.opscenter.client.plugin.OperationsCenterCredentialsProvider.remotingImplementation=true: switch Shared Agent / Shared Cloud Agent provisioner back to use remoting instead of HTTP transport

  • -Dcom.cloudbees.opscenter.client.plugin.OperationsCenterRootAction.remotingSlaveManager=true: switch Operations Center Credentials provider back to use remoting instead of HTTP transport

But those are not supported for High Availability (active/active) controller.

Resolution

Upgrade to CloudBees CI version 2.452.2.3 (or newer).

Workaround

Those recommendation help to mitigate the impact and slow down the thread leak but do not solve the problem.

To help mitigate these issues, it is recommended to upgrade to version 2.440.2.1 or later and add the following system properties to the Controller(s):

  • -Djdk.httpclient.keepalive.timeout=30

  • -Djdk.internal.httpclient.selectorTimeout=1000

  • -Dcom.cloudbees.opscenter.client.plugin.OperationsCenterCredentialsProvider.remotingImplementation=true - requires CloudBees CI 2.440.1.3 (or later) and not supported for High Availability (active/active) controllers

  • -Dcom.cloudbees.opscenter.client.plugin.OperationsCenterRootAction.remotingSlaveManager=true - requires CloudBees CI 2.440.1.3 (or later) and not supported for High Availability (active/active) controllers

See How to add Java arguments to Jenkins for how to do this.