CloudBees CI on traditional platforms High Availability (active/active) not forming a cluster

2 minute readKnowledge base

Issue

I am experiencing issues with my CloudBees CI on traditional platforms High Availability (active/active) controllers not forming a cluster. In the logs of the replicas, there may be messages similar to:

[cloudbees-replication] [5.3.6] [IP]:5701 is added to the blacklist. ... INFO c.h.internal.cluster.ClusterService: [IP]:5701 [cloudbees-replication] [5.3.6] Members {size:1, ver:1} [ Member [IP]:5701 - ID this ] ... [OperationsCenter2 connection to HOSTNAME/IP:50000] Local headers refused by remote: The controller 0-client-controller is already connected

Resolution

If your CloudBees CI on traditional platforms controllers are not forming a cluster, the troubleshooting steps are:

  1. Review the installation steps to ensure that the required JVM arguments are added Installation for CloudBees CI on traditional platforms (active/active)

  2. Log in to one of the replicas, and check what IP and port is listed in JENKINS_HOME/cloudbees-replication-discovery/*, there should be one file for each replica.

    1. If the IP and port of the other replica is what you expect, then the issue is likely with the network configuration. To troubleshoot this, you can use nc -z (or telnet) to check if the replicas can communicate with each other over the IP and port from JENKINS_HOME/cloudbees-replication-discovery/*:

      nc -z other-replica-ip other-replica-port # or # telnet other-replica-ip other-replica-port

      If the connection is successful, the expected output is:

      # expected output for nc -z Connection to other-replica-ip port other-replica-port succeeded! # expected output for telnet Trying other-replica-ip... Connected to other-replica-ip. Escape character is '^]'.
    2. If the IP and port of the other replica is not what you expect, then Hazelcast likely chose one of the other network interfaces on the machine to bind to. You can configure Hazelcast to bind to a specific network interface by adding the following additional JVM arguments to each replica. Update IP_PATTERN below to match the IP pattern of the network interface you want to use. For example, setting IP_PATTERN to 10.3.10.* will search for a network interface on the machine with an IP between 10.3.10.0 and 10.3.10.255 and use it:

      -Dhz.network.interfaces.enabled=true \ -Dhz.network.interfaces.interfaces.interface1='IP_PATTERN' \ -Dhz.network.port.port=5701 \ -Dhz.network.port.autoincrement=true \ -Dhazelcast.jmx=false \ -Dhazelcast.metrics.jmx.enabled=false \ -Dhazelcast.health.monitoring.delay.seconds=180 \ -Dhazelcast.health.monitoring.threshold.memory.percentage=99 \ -Dhazelcast.health.monitoring.threshold.cpu.percentage=99 \
      Ensure to enable all the above options on all replicas (not only a subset), and restart all replicas after making the changes.
  3. If you are still encountering issues, add -Dhazelcast.diagnostics.enabled=true to your JVM startup options and add a custom logger to the class com.cloudbees.jenkins.plugins.replication.hazelcast.FilesystemDiscoveryStrategy with the level FINE to get more information about the discovery process.

Tested product/plugin versions

  • CloudBees CI on traditional platforms - client controller - 2.440.1.4