Issue
-
After upgrading CloudBees CI Controller past version 2.401.1.3, the controller shows a lot of
HttpClient--SelectorManager-
/HttpClient--Worker-
threads such as:"HttpClient-1234-SelectorManager" id=1234 (0xXXXX) state=RUNNABLE cpu=0% (running in native) at java.base@11.0.20/sun.nio.ch.EPoll.wait(Native Method) at java.base@11.0.20/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120) at java.base@11.0.20/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124) - locked sun.nio.ch.EPollSelectorImpl@68c7fc70 at java.base@11.0.20/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136) at platform/java.net.http@11.0.20/jdk.internal.net.http.HttpClientImpl$SelectorManager.run(HttpClientImpl.java:867)
Environment
-
CloudBees CI on modern cloud platforms - operations center >= 2.401.1.3
-
CloudBees CI on modern cloud platforms - managed controller >= 2.401.1.3
-
CloudBees CI on traditional platforms - operations center >= 2.401.1.3
-
CloudBees CI on traditional platforms - client controller >= 2.401.1.3
-
Java < 21
Related Issue(s)
-
BEE-39352: Excessive HTTP client threads created
-
JENKINS-72016: ProxyConfiguration.newHttpClientBuilder should maintain a thread pool
-
BEE-45015: Switch shared agent provisioner & OC credentials provider to HTTP by default
-
BEE-46835: Memory leak in OperationsCenterRootAction
-
BEE-49448: Java HTTP API thread leak in OperationsCenter server actions
-
JDK-8288746: HttpClient resources could be reclaimed more eagerly
-
JDK-8297030: Reduce Default Keep-Alive Timeout Value for httpclient
-
JDK-8308364: HttpClientImpl SelectorManager thread never dies if response body not retrieved
-
JDK-8267140: Support closing the HttpClient by making it auto-closable
Explanation
Starting with CloudBees CI version 2.401.1.3
, the Operations Center Client plugin leverages a new HTTP API in its communication with the Operations Center that uses the Jenkins core high-level HTTP client API and downstream the code relies on the native Java HTTP Client. Note the requests rate from a Controller to Operations Center varies depending on the environment and the activity.
The main features that generate requests from controller to Operations Center include - but are not limited to - Operations Center Single Sign-on, Shared Agent / Shared Cloud Agent provisioning and Credential lookups.
The nature and the usage of the Java HTTP Client API in JDK < 21 presents some challenges. The HTTP Client is not "closeable" and the Java HTTP Client internals takes care of reclaiming the client resources. The client also requires some specific usage to avoid leaks (such as fully consuming the response body).
Chronology of Feature and Fixes
-
Introduction of the Operations Center HTTP transport API
-
Excessive HTTP client threads created: Prevent a Java HTTP Client leak by using a shared
CachedThreadPool
-
Shared Agents and Operations Center credentials now use the HTTP transport by default
-
Fixed memory leak in Operations center: Fixed some wrong usage of Java HTTP Client that caused HTTP Clients leak (was undocumented until JDK-8327991)
Remaining Issues
By version 2.440.2.1 most of the known issues have been addressed but there are remaining scalability problems that require Java 21 to be addressed:
In particular, the following behaviors are critical:
-
The Java HTTP Client has a default idle timeout of
20
minutes. In the context of the Operations Center Client feature, every time a call is made to Operations Center, a client and the connection it established are leaked for20
minutes even if the requests have been completed (successfully or not). -
Each Java HTTP Client use a SelectorManager thread pool that leak a thread until all references of the HTTP Client itself have been garbage collected.
This causes a pseudo thread leak as thread resources will not be reclaimed immediately but after a rather non-determined amount of time around the idle timeout. This consequently causes an open file leak and eventually general performance problems. The impact depends on the activity on the Controller. If the rate at which the controller calls the Operations Center is so high that it does not allow for the resource to be reclaimed in time, the thread count grows and the controller eventually suffers from critical performance issues.
System properties can be used to help the Java HTTP Client API to reclaim resources faster:
-
-Djdk.httpclient.keepalive.timeout=30
: Reduce the default Keep-Alive timeout of HTTP Client from1200
to30
seconds (as per the fix in the JDK 21). -
-Djdk.internal.httpclient.selectorTimeout=1000
: Reduce the delay when the SelectorManager checks on HTTP Client reference from3
seconds to1
seconds.
But addressing the HttpClient timeouts is not enough in large and active environments.
System properties can be used to reduce the number of HTTP calls being made from Controller to Operations Center:
-
-Dcom.cloudbees.opscenter.client.plugin.OperationsCenterCredentialsProvider.remotingImplementation=true
: Switch Operations Center Credentials provider back to use remoting instead of HTTP transport -
-Dcom.cloudbees.opscenter.client.plugin.OperationsCenterRootAction.remotingSlaveManager=true
: Switch Shared Agent / Shared Cloud Agent provisioner back to use remoting instead of HTTP transport
But those are not supported for High Availability (active/active) controller.
Workaround
Those recommendation help to mitigate the impact and slow down the thread leak but do not solve the problem. |
To help mitigate these issues, it is recommended to upgrade to version 2.440.2.1 or later and add the following system properties to the Controller(s):
-
-Djdk.httpclient.keepalive.timeout=30
-
-Djdk.internal.httpclient.selectorTimeout=1000
-
-Dcom.cloudbees.opscenter.client.plugin.OperationsCenterCredentialsProvider.remotingImplementation=true
- requires CloudBees CI 2.440.1.3 (or later) and not supported for High Availability (active/active) controllers -
-Dcom.cloudbees.opscenter.client.plugin.OperationsCenterRootAction.remotingSlaveManager=true
- requires CloudBees CI 2.440.1.3 (or later) and not supported for High Availability (active/active) controllers
See How to add Java arguments to Jenkins for how to do this.