High Availability (active/passive) Installations Troubleshooting

Symptoms

High Availability (active/passive) is not working as expected.

Diagnosis/Treatment

This article gives simple steps for troubleshooting High Availability (active/passive) installations. If these troubleshooting steps do not resolve your issue, follow the steps outlined in Required Data: HA (active/passive) issues to gather the most relevant information for the Support Team to resolve your issue as quickly as possible.

Troubleshooting

Versions

Ensure that:

All instances must run the same Jenkins version
All instances must run using the same JDK version

operations center and controller nodes do not form a cluster

See the JGroups troubleshooting guide for typical problems. When nodes don’t form a cluster, it is normally either because the protocol needs additional configuration, or there’s a problem in the network configuration of the operating system or the network equipment (e.g., nodes cannot “see” one another via TCP).

Ensure that all the instances are using the same Unix user:group

Ensure that the userID used to run the Jenkins process is the same on the 3 of the servers: NFS, Jenkins primary + Jenkins failover.

Run on the three instances the following commands and ensure that the ID used is the same for the user. On this example jenkins:jenkins is used, which is the user:group created by default if you installed Jenkins via Unix package.

id -u jenkins

In case the ID is not the same on the three instances, the following Unix commands can be used to create a jenkins:jenkins - which is needed in the NFS instance and to modify the ID of the user and group.

sudo useradd jenkins
sudo usermod -u [jenkins UID] jenkins

sudo groupadd jenkins
sudo groupmod -g [jenkins GID] jenkins

Ensure that the owner of the $JENKINS_HOME is the user which run the Jenkins process

The user which owns the Jenkins process should be the owner of the $JENKINS_HOME. Run ls -la $JENKINS_HOME to check this. If you need to change the owner, the following Unix command can be used.

sudo chown -R jenkins:jenkins $JENKINS_HOME

Customize JGroups in case your instances are running behind a firewall

By default, the CloudBees HA plugin uses a random port to communication. If the instances are running behind a firewall, you have two options to configure the fixed ports.

The examples below mean that you need to open the following ports on the firewall: 56736, 56737, 56738, 56739, 56740 and 35483.

High Availability Configuration GUI [Recommended way for JGroups customization]

Starting from CloudBees High Availability 4.8 it is possible to configure the ports via the UI. Open Manage Jenkins > Configure System and locate the High Availability Configuration section. Enable the customization, specify the ports and restart both nodes in HA singleton.

NOTE: High Availability configuration through the GUI is the preferred way of customizing JGroups if you only need to customize the JGroups ports used in the communication between the instances which are members of the cluster. The reason is that HA through GUI is always expected to work as the product takes care of the customization upon JGroups upgrades. On the other hand, the customization through $JENKINS_HOME/jgroups.xml might require you manual intervention to keep jgroups.xml up to date with the JGroups version running in the product.

Manual customization of the JGroups config for version 2.249.2.3 or higher

If you need to customize more than the JGroups ports to be used, then it will not be possible to perform the JGroups customization through the GUI. It will need to be done placing a snipped like the one below in $JENKINS_HOME/jgroups.xml.

The example below shows you how to customize bind_port, port_range and diagnostics_port for version 2.249.2.3 or higher. The new JGroups configuration is only applied after a restart of all the instances which conform the cluster. In the example below, JGroups will use the ports 35483, 56736, 56737, 56738, 56739 and 56740.

bind_port="56736"
port_range="5"
diagnostics_port="35483"

<!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml -->
<config xmlns="urn:org:jgroups"
        xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups https://www.jgroups.org/schema/jgroups-4.0.xsd">
    <TCP_NIO2
         recv_buf_size="${tcp.recv_buf_size:128K}"
         send_buf_size="${tcp.send_buf_size:128K}"
         max_bundle_size="64K"
         sock_conn_timeout="1000"

         bind_port="56736"
         port_range="5"
         diagnostics_port="35483"

         thread_pool.enabled="true"
         thread_pool.min_threads="1"
         thread_pool.max_threads="10"
         thread_pool.keep_alive_time="5000"/>

    <CENTRAL_LOCK />

    <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING
             location="${HA_JGROUPS_DIR}"
             remove_old_coords_on_view_change="true"/>
    <MERGE3 max_interval="30000"
            min_interval="10000"/>
    <FD_SOCK/>
    <FD timeout="3000" max_tries="3" />
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK2 use_mcast_xmit="false"
                    discard_delivered_msgs="true"/>
    <UNICAST3 />
    <!--
      When a new node joins a cluster, initial message broadcast doesn't necessarily seem
      to arrive. Using a shorter cycles in the STABLE protocol makes the cluster recognize
      this dropped transmission and cause a retransmission.
    -->
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"
                max_join_attempts="5"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
    <pbcast.STATE_TRANSFER />
    <!-- pbcast.FLUSH  /-->
</config>

Manual customization of the JGroups config previous to 2.249.2.3

The example below shows you how to customize bind_port, port_range and diagnostics_port for versions of the product previous to 2.249.2.3. The new JGroups configuration is only applied after a restart of all the instances which conform the cluster. In the example below, JGroups will use the ports 35483, 56736, 56737, 56738, 56739 and 56740.

bind_port="56736"
port_range="5"
diagnostics_port="35483"

<!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml -->
<config xmlns="urn:org:jgroups"
        xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups https://www.jgroups.org/schema/JGroups-3.1.xsd">
    <TCP_NIO2
         recv_buf_size="${tcp.recv_buf_size:128K}"
         send_buf_size="${tcp.send_buf_size:128K}"
         max_bundle_size="64K"
         max_bundle_timeout="30"
         sock_conn_timeout="1000"

         bind_port="56736"
         port_range="5"
         diagnostics_port="35483"

         timer_type="new"
         timer.min_threads="4"
         timer.max_threads="10"
         timer.keep_alive_time="3000"
         timer.queue_max_size="500"

         thread_pool.enabled="true"
         thread_pool.min_threads="1"
         thread_pool.max_threads="10"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="false"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="discard"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="1"
         oob_thread_pool.max_threads="8"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="discard"/>

    <CENTRAL_LOCK />

    <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING
             location="${HA_JGROUPS_DIR}"
             remove_old_coords_on_view_change="true"
             remove_all_files_on_view_change="true"/>
    <MERGE2 max_interval="30000"
            min_interval="10000"/>
    <FD_SOCK/>
    <FD timeout="3000" max_tries="3" />
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK2 use_mcast_xmit="false"
                   discard_delivered_msgs="true"/>
    <UNICAST />
    <!--
      When a new node joins a cluster, initial message broadcast doesn't necessarily seem
      to arrive. Using a shorter cycles in the STABLE protocol makes the cluster recognize
      this dropped transmission and cause a retransmission.
    -->
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"
                max_join_attempts="5"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
    <pbcast.STATE_TRANSFER />
    <!-- pbcast.FLUSH  /-->
</config>

Bind JGroups with the right network interface

In case that any of the two instances which tries to form a cluster has more than one network interface you must tell JGroups which one should be used to listen for packets. The following two Java arguments are used for this purpose:

-Djgroups.bind_addr=<IP_ADDRESS>
-Djava.net.preferIPv4Stack=true

where <IP_ADDRESS> is the IP address of the network interface in the instance which should be able to reach out the other node in the cluster.

NOTE: In case of any issue with the CloudBees HA plugin these arguments should be set-up to ensure JGroups is bound to the right interface.

Typical stacktraces under failure

If both nodes are failing are being running as a singleton mode

There is a miscommunication between the nodes of the Jgroup cluster, thus a similar trace like the following is being shown in both nodes

-------------------------------------------------------------------
GMS: address=stg-rbl-jnk-mas-w2a-a-43979, cluster=Jenkins, physical address=10.65.45.107:38926
-------------------------------------------------------------------
Oct 10, 2018 5:09:08 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 1
Oct 10, 2018 5:09:11 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 2
Oct 10, 2018 5:09:14 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 3
Oct 10, 2018 5:09:17 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 4
Oct 10, 2018 5:09:20 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 5
Oct 10, 2018 5:09:20 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
WARNING: stg-rbl-jnk-mas-w2a-a-43979: too many JOIN attempts (5): becoming singleton
Oct 10, 2018 5:09:20 AM com.cloudbees.jenkins.ha.singleton.HASingleton$3 viewAccepted
INFO: Cluster membership has changed to: [stg-rbl-jnk-mas-w2a-a-43979|0] (1) [stg-rbl-jnk-mas-w2a-a-43979]
Oct 10, 2018 5:09:20 AM com.cloudbees.jenkins.ha.singleton.HASingleton$3 viewAccepted
INFO: New primary node is JenkinsClusterMemberIdentity[member=stg-rbl-jnk-mas-w2a-a-43979,weight=0,min=0]
Oct 10, 2018 5:09:20 AM com.cloudbees.jenkins.ha.singleton.HASingleton reactToPrimarySwitch
INFO: Elected as the primary node

Action: Delete all the content inside $JENKINS_HOME/jgroups to reset the Jgroups cluster. The content will be regenerated after the next restart of the active and pasive nodes.

Typical stack trace when CloudBees CI fails to start because of a non-updated $JENKINS_HOME/jgroups.xml file

The following stacktrace happens when your instances were updated to 2.249.2.3 or higher, but your $JENKINS_HOME/jgroups.xml was not migrated to the new JGroups format. Since 2.249.2.3, JGroups was updated and the previous syntax used might not work. The KB Upgrade guide for instances running High Availability previous to 2.249.2.3 explains you how to address this issue.

2020-10-14 10:30:14.066+0000 [id=1] SEVERE  c.c.jenkins.ha.HASwitcher#reportFallback: CloudBees CI Operations Center appears to have failed to boot. If this is a problem in the HA feature, you can disable HA by specifying JENKINS_HA=false as environment variable
java.lang.IllegalArgumentException: JGRP000001: configuration error: the following properties in TCP_NIO2 are not recognized: {oob_thread_pool.enabled=true, timer.keep_alive_time=3000, thread_pool.queue_enabled=false, thread_pool.queue_max_size=100, oob_thread_pool.queue_max_size=100, oob_thread_pool.keep_alive_time=5000, oob_thread_pool.min_threads=1, oob_thread_pool.queue_enabled=false, oob_thread_pool.max_threads=8, oob_thread_pool.rejection_policy=discard, thread_pool.rejection_policy=discard, timer.queue_max_size=500, timer.min_threads=4, max_bundle_timeout=30, timer.max_threads=10, timer_type=new}
        at org.jgroups.stack.Configurator.createLayer(Configurator.java:278)
        at org.jgroups.stack.Configurator.createProtocols(Configurator.java:215)
        at org.jgroups.stack.Configurator.setupProtocolStack(Configurator.java:82)
        at org.jgroups.stack.Configurator.setupProtocolStack(Configurator.java:49)
        at org.jgroups.stack.ProtocolStack.setup(ProtocolStack.java:475)
        at org.jgroups.JChannel.init(JChannel.java:965)
        at org.jgroups.JChannel.<init>(JChannel.java:148)
        at org.jgroups.JChannel.<init>(JChannel.java:106)
        at com.cloudbees.jenkins.ha.AbstractJenkinsSingleton.createChannel(AbstractJenkinsSingleton.java:143)
        at com.cloudbees.jenkins.ha.singleton.HASingleton.start(HASingleton.java:86)
Caused: java.lang.Error: Failed to form a cluster
	at com.cloudbees.jenkins.ha.singleton.HASingleton.start(HASingleton.java:179)

Typical stack trace when CloudBees CI fails to start because JGroups customized through GUI + running 2.249.2.3

The following stacktrace happens when your instances were updated to 2.249.2.3, and JGroups customization was performed through the GUI. The issue can be fixed by either upgrade the product to 2.249.2.4, or placing the following snipped under $JENKINS_HOME/jgroups.xml.

020-10-12 21:25:06.280+0000 [id=1] SEVERE winstone.Logger#logInternal: Container startup failed
java.lang.IllegalArgumentException: JGRP000001: configuration error: the following properties in com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING are not recognized: {remove_old_files_on_view_change=true}
	at org.jgroups.stack.Configurator.createLayer(Configurator.java:278)
	at org.jgroups.stack.Configurator.createProtocols(Configurator.java:215)
	at org.jgroups.stack.Configurator.setupProtocolStack(Configurator.java:82)
	at org.jgroups.stack.Configurator.setupProtocolStack(Configurator.java:49)
	at org.jgroups.stack.ProtocolStack.setup(ProtocolStack.java:475)
	at org.jgroups.JChannel.init(JChannel.java:965)
	at org.jgroups.JChannel.<init>(JChannel.java:148)
	at org.jgroups.JChannel.<init>(JChannel.java:122)
	at com.cloudbees.jenkins.ha.AbstractJenkinsSingleton.createChannel(AbstractJenkinsSingleton.java:176)
	at com.cloudbees.jenkins.ha.singleton.HASingleton.start(HASingleton.java:86)
Caused: java.lang.Error: Failed to form a cluster
	at com.cloudbees.jenkins.ha.singleton.HASingleton.start(HASingleton.java:179)

<!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml -->
<config xmlns="urn:org:jgroups"
        xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups https://www.jgroups.org/schema/jgroups-4.0.xsd">
    <TCP_NIO2
         recv_buf_size="${tcp.recv_buf_size:128K}"
         send_buf_size="${tcp.send_buf_size:128K}"
         max_bundle_size="64K"
         sock_conn_timeout="1000"

         bind_port="${HA_BIND_PORT}"
         port_range="${HA_PORT_RANGE}"
         diagnostics_port="${HA_DIAGNOSTIC_PORT}"

         thread_pool.enabled="true"
         thread_pool.min_threads="1"
         thread_pool.max_threads="10"
         thread_pool.keep_alive_time="5000"/>

    <CENTRAL_LOCK />

    <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING
             location="${HA_JGROUPS_DIR}"
             remove_old_coords_on_view_change="true"/>
    <MERGE3 max_interval="30000"
            min_interval="10000"/>
    <FD_SOCK/>
    <FD timeout="3000" max_tries="3" />
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK2 use_mcast_xmit="false"
                    discard_delivered_msgs="true"/>
    <UNICAST3 />
    <!--
      When a new node joins a cluster, initial message broadcast doesn't necessarily seem
      to arrive. Using a shorter cycles in the STABLE protocol makes the cluster recognize
      this dropped transmission and cause a retransmission.
    -->
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"
                max_join_attempts="5"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
    <pbcast.STATE_TRANSFER />
    <!-- pbcast.FLUSH  /-->
</config>

The KB Upgrade guide for instances running High Availability previous to 2.249.2.3 can give you more details about the source of the issue and how to address it.

If the promotion works on the troubleshooter and not in the instances

If the promotion works on the troubleshooter and not in the instances the following experiment is recommended:

Ensure that the latest version of the cloudbees-ha plugin is installed on both instances.
Stop the service of the Jenkins instances
Add the following Java argument -Dcom.cloudbees.jenkins.ha.level=ALL in both instances so all the JGroups and HA logs are exposed.
Start one instance
Wait 10 seconds or so
Start the second instance
Check the full logs of Jenkins instances that you can usually find under /var/log/jenkins

After the experiment you should remove the argument -Dcom.cloudbees.jenkins.ha.level=ALL you just added on both instances.