Issue
I am experiencing issues with my CloudBees CI on traditional platforms High Availability (active/active) controllers not forming a cluster. In the logs of the replicas, there may be messages similar to:
[cloudbees-replication] [5.3.6] [IP]:5701 is added to the blacklist. ... INFO c.h.internal.cluster.ClusterService: [IP]:5701 [cloudbees-replication] [5.3.6] Members {size:1, ver:1} [ Member [IP]:5701 - ID this ] ... [OperationsCenter2 connection to HOSTNAME/IP:50000] Local headers refused by remote: The controller 0-client-controller is already connected
Resolution
If your CloudBees CI on traditional platforms controllers are not forming a cluster, the troubleshooting steps are:
-
Review the installation steps to ensure that the required JVM arguments are added Installation for CloudBees CI on traditional platforms (active/active)
-
Log in to one of the replicas, and check what IP and port is listed in
JENKINS_HOME/cloudbees-replication-discovery/*
, there should be one file for each replica.-
If the IP and port of the other replica is what you expect, then the issue is likely with the network configuration. To troubleshoot this, you can use
nc -z
(ortelnet
) to check if the replicas can communicate with each other over the IP and port fromJENKINS_HOME/cloudbees-replication-discovery/*
:nc -z other-replica-ip other-replica-port # or # telnet other-replica-ip other-replica-port
If the connection is successful, the expected output is:
# expected output for nc -z Connection to other-replica-ip port other-replica-port succeeded! # expected output for telnet Trying other-replica-ip... Connected to other-replica-ip. Escape character is '^]'.
-
If the IP and port of the other replica is not what you expect, then Hazelcast likely chose one of the other network interfaces on the machine to bind to. You can configure Hazelcast to bind to a specific network interface by adding the following additional JVM arguments to each replica. Update
IP_PATTERN
below to match the IP pattern of the network interface you want to use. For example, settingIP_PATTERN
to10.3.10.*
will search for a network interface on the machine with an IP between10.3.10.0
and10.3.10.255
and use it:-Dhz.network.interfaces.enabled=true \ -Dhz.network.interfaces.interfaces.interface1='IP_PATTERN' \ -Dhz.network.port.port=5701 \ -Dhz.network.port.autoincrement=true \ -Dhazelcast.jmx=false \ -Dhazelcast.metrics.jmx.enabled=false \ -Dhazelcast.health.monitoring.delay.seconds=180 \ -Dhazelcast.health.monitoring.threshold.memory.percentage=99 \ -Dhazelcast.health.monitoring.threshold.cpu.percentage=99 \
Ensure to enable all the above options on all replicas (not only a subset), and restart all replicas after making the changes.
-
-
If you are still encountering issues, add
-Dhazelcast.diagnostics.enabled=true
to your JVM startup options and add a custom logger to the classcom.cloudbees.jenkins.plugins.replication.hazelcast.FilesystemDiscoveryStrategy
with the levelFINE
to get more information about the discovery process.