Diagnosis/Treatment
This article gives simple steps for troubleshooting High Availability (active/passive) installations. If these troubleshooting steps do not resolve your issue, follow the steps outlined in Required Data: HA (active/passive) issues to gather the most relevant information for the Support Team to resolve your issue as quickly as possible.
Troubleshooting
Versions
Ensure that:
-
All instances must run the same Jenkins version
-
All instances must run using the same JDK version
operations center and controller nodes do not form a cluster
See the JGroups troubleshooting guide for typical problems. When nodes don’t form a cluster, it is normally either because the protocol needs additional configuration, or there’s a problem in the network configuration of the operating system or the network equipment (e.g., nodes cannot “see” one another via TCP).
Ensure that all the instances are using the same Unix user:group
Ensure that the userID used to run the Jenkins process is the same on the 3 of the servers: NFS, Jenkins primary + Jenkins failover.
Run on the three instances the following commands and ensure that the ID used is the same for the user. On this example jenkins:jenkins is used, which is the user:group created by default if you installed Jenkins via Unix package.
id -u jenkins
In case the ID is not the same on the three instances, the following Unix commands can be used to create a jenkins:jenkins - which is needed in the NFS instance and to modify the ID of the user and group.
sudo useradd jenkins sudo usermod -u [jenkins UID] jenkins
sudo groupadd jenkins sudo groupmod -g [jenkins GID] jenkins
Ensure that the owner of the $JENKINS_HOME is the user which run the Jenkins process
The user which owns the Jenkins process should be the owner of the $JENKINS_HOME. Run ls -la $JENKINS_HOME
to check this. If you need to change the owner, the following Unix command can be used.
sudo chown -R jenkins:jenkins $JENKINS_HOME
Customize JGroups in case your instances are running behind a firewall
By default, the CloudBees HA plugin uses a random port to communication. If the instances are running behind a firewall, you have two options to configure the fixed ports.
The examples below mean that you need to open the following ports on the firewall: 56736, 56737, 56738, 56739, 56740 and 35483.
High Availability Configuration GUI [Recommended way for JGroups customization]
Starting from CloudBees High Availability 4.8 it is possible to configure the ports via the UI. Open Manage Jenkins > Configure System
and locate the High Availability Configuration
section. Enable the customization, specify the ports and restart both nodes in HA singleton.
-
NOTE: High Availability configuration through the GUI is the preferred way of customizing JGroups if you only need to customize the JGroups ports used in the communication between the instances which are members of the cluster. The reason is that HA through GUI is always expected to work as the product takes care of the customization upon JGroups upgrades. On the other hand, the customization through
$JENKINS_HOME/jgroups.xml
might require you manual intervention to keepjgroups.xml
up to date with the JGroups version running in the product.
Manual customization of the JGroups config for version 2.249.2.3 or higher
If you need to customize more than the JGroups ports to be used, then it will not be possible to perform the JGroups customization through the GUI. It will need to be done placing a snipped like the one below in $JENKINS_HOME/jgroups.xml
.
The example below shows you how to customize bind_port
, port_range
and diagnostics_port
for version 2.249.2.3 or higher. The new JGroups configuration is only applied after a restart of all the instances which conform the cluster. In the example below, JGroups will use the ports 35483, 56736, 56737, 56738, 56739 and 56740.
bind_port="56736" port_range="5" diagnostics_port="35483"
<!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml --> <config xmlns="urn:org:jgroups" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups https://www.jgroups.org/schema/jgroups-4.0.xsd"> <TCP_NIO2 recv_buf_size="${tcp.recv_buf_size:128K}" send_buf_size="${tcp.send_buf_size:128K}" max_bundle_size="64K" sock_conn_timeout="1000" bind_port="56736" port_range="5" diagnostics_port="35483" thread_pool.enabled="true" thread_pool.min_threads="1" thread_pool.max_threads="10" thread_pool.keep_alive_time="5000"/> <CENTRAL_LOCK /> <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING location="${HA_JGROUPS_DIR}" remove_old_coords_on_view_change="true"/> <MERGE3 max_interval="30000" min_interval="10000"/> <FD_SOCK/> <FD timeout="3000" max_tries="3" /> <VERIFY_SUSPECT timeout="1500" /> <BARRIER /> <pbcast.NAKACK2 use_mcast_xmit="false" discard_delivered_msgs="true"/> <UNICAST3 /> <!-- When a new node joins a cluster, initial message broadcast doesn't necessarily seem to arrive. Using a shorter cycles in the STABLE protocol makes the cluster recognize this dropped transmission and cause a retransmission. --> <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="4M"/> <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true" max_join_attempts="5"/> <MFC max_credits="2M" min_threshold="0.4"/> <FRAG2 frag_size="60K" /> <pbcast.STATE_TRANSFER /> <!-- pbcast.FLUSH /--> </config>
Manual customization of the JGroups config previous to 2.249.2.3
If you need to customize more than the JGroups ports to be used, then it will not be possible to perform the JGroups customization through the GUI. It will need to be done placing a snipped like the one below in $JENKINS_HOME/jgroups.xml
.
The example below shows you how to customize bind_port
, port_range
and diagnostics_port
for versions of the product previous to 2.249.2.3. The new JGroups configuration is only applied after a restart of all the instances which conform the cluster. In the example below, JGroups will use the ports 35483, 56736, 56737, 56738, 56739 and 56740.
bind_port="56736" port_range="5" diagnostics_port="35483"
<!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml --> <config xmlns="urn:org:jgroups" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups https://www.jgroups.org/schema/JGroups-3.1.xsd"> <TCP_NIO2 recv_buf_size="${tcp.recv_buf_size:128K}" send_buf_size="${tcp.send_buf_size:128K}" max_bundle_size="64K" max_bundle_timeout="30" sock_conn_timeout="1000" bind_port="56736" port_range="5" diagnostics_port="35483" timer_type="new" timer.min_threads="4" timer.max_threads="10" timer.keep_alive_time="3000" timer.queue_max_size="500" thread_pool.enabled="true" thread_pool.min_threads="1" thread_pool.max_threads="10" thread_pool.keep_alive_time="5000" thread_pool.queue_enabled="false" thread_pool.queue_max_size="100" thread_pool.rejection_policy="discard" oob_thread_pool.enabled="true" oob_thread_pool.min_threads="1" oob_thread_pool.max_threads="8" oob_thread_pool.keep_alive_time="5000" oob_thread_pool.queue_enabled="false" oob_thread_pool.queue_max_size="100" oob_thread_pool.rejection_policy="discard"/> <CENTRAL_LOCK /> <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING location="${HA_JGROUPS_DIR}" remove_old_coords_on_view_change="true" remove_all_files_on_view_change="true"/> <MERGE2 max_interval="30000" min_interval="10000"/> <FD_SOCK/> <FD timeout="3000" max_tries="3" /> <VERIFY_SUSPECT timeout="1500" /> <BARRIER /> <pbcast.NAKACK2 use_mcast_xmit="false" discard_delivered_msgs="true"/> <UNICAST /> <!-- When a new node joins a cluster, initial message broadcast doesn't necessarily seem to arrive. Using a shorter cycles in the STABLE protocol makes the cluster recognize this dropped transmission and cause a retransmission. --> <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="4M"/> <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true" max_join_attempts="5"/> <MFC max_credits="2M" min_threshold="0.4"/> <FRAG2 frag_size="60K" /> <pbcast.STATE_TRANSFER /> <!-- pbcast.FLUSH /--> </config>
Bind JGroups with the right network interface
In case that any of the two instances which tries to form a cluster has more than one network interface you must tell JGroups which one should be used to listen for packets. The following two Java arguments are used for this purpose:
-Djgroups.bind_addr=<IP_ADDRESS> -Djava.net.preferIPv4Stack=true
where <IP_ADDRESS>
is the IP address of the network interface in the instance which should be able to reach out the other node in the cluster.
NOTE: In case of any issue with the CloudBees HA plugin these arguments should be set-up to ensure JGroups is bound to the right interface.
Typical stacktraces under failure
If both nodes are failing are being running as a singleton mode
There is a miscommunication between the nodes of the Jgroup cluster, thus a similar trace like the following is being shown in both nodes
------------------------------------------------------------------- GMS: address=stg-rbl-jnk-mas-w2a-a-43979, cluster=Jenkins, physical address=10.65.45.107:38926 ------------------------------------------------------------------- Oct 10, 2018 5:09:08 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 1 Oct 10, 2018 5:09:11 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 2 Oct 10, 2018 5:09:14 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 3 Oct 10, 2018 5:09:17 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 4 Oct 10, 2018 5:09:20 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 5 Oct 10, 2018 5:09:20 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal WARNING: stg-rbl-jnk-mas-w2a-a-43979: too many JOIN attempts (5): becoming singleton Oct 10, 2018 5:09:20 AM com.cloudbees.jenkins.ha.singleton.HASingleton$3 viewAccepted INFO: Cluster membership has changed to: [stg-rbl-jnk-mas-w2a-a-43979|0] (1) [stg-rbl-jnk-mas-w2a-a-43979] Oct 10, 2018 5:09:20 AM com.cloudbees.jenkins.ha.singleton.HASingleton$3 viewAccepted INFO: New primary node is JenkinsClusterMemberIdentity[member=stg-rbl-jnk-mas-w2a-a-43979,weight=0,min=0] Oct 10, 2018 5:09:20 AM com.cloudbees.jenkins.ha.singleton.HASingleton reactToPrimarySwitch INFO: Elected as the primary node
Action: Delete all the content inside $JENKINS_HOME/jgroups
to reset the Jgroups cluster. The content will be regenerated after the next restart of the active and pasive nodes.
Typical stack trace when CloudBees CI fails to start because of a non-updated $JENKINS_HOME/jgroups.xml file
The following stacktrace happens when your instances were updated to 2.249.2.3 or higher, but your $JENKINS_HOME/jgroups.xml
was not migrated to the new JGroups format. Since 2.249.2.3, JGroups was updated and the previous syntax used might not work. The KB Upgrade guide for instances running High Availability previous to 2.249.2.3 explains you how to address this issue.
2020-10-14 10:30:14.066+0000 [id=1] SEVERE c.c.jenkins.ha.HASwitcher#reportFallback: CloudBees CI Operations Center appears to have failed to boot. If this is a problem in the HA feature, you can disable HA by specifying JENKINS_HA=false as environment variable java.lang.IllegalArgumentException: JGRP000001: configuration error: the following properties in TCP_NIO2 are not recognized: {oob_thread_pool.enabled=true, timer.keep_alive_time=3000, thread_pool.queue_enabled=false, thread_pool.queue_max_size=100, oob_thread_pool.queue_max_size=100, oob_thread_pool.keep_alive_time=5000, oob_thread_pool.min_threads=1, oob_thread_pool.queue_enabled=false, oob_thread_pool.max_threads=8, oob_thread_pool.rejection_policy=discard, thread_pool.rejection_policy=discard, timer.queue_max_size=500, timer.min_threads=4, max_bundle_timeout=30, timer.max_threads=10, timer_type=new} at org.jgroups.stack.Configurator.createLayer(Configurator.java:278) at org.jgroups.stack.Configurator.createProtocols(Configurator.java:215) at org.jgroups.stack.Configurator.setupProtocolStack(Configurator.java:82) at org.jgroups.stack.Configurator.setupProtocolStack(Configurator.java:49) at org.jgroups.stack.ProtocolStack.setup(ProtocolStack.java:475) at org.jgroups.JChannel.init(JChannel.java:965) at org.jgroups.JChannel.<init>(JChannel.java:148) at org.jgroups.JChannel.<init>(JChannel.java:106) at com.cloudbees.jenkins.ha.AbstractJenkinsSingleton.createChannel(AbstractJenkinsSingleton.java:143) at com.cloudbees.jenkins.ha.singleton.HASingleton.start(HASingleton.java:86) Caused: java.lang.Error: Failed to form a cluster at com.cloudbees.jenkins.ha.singleton.HASingleton.start(HASingleton.java:179)
Typical stack trace when CloudBees CI fails to start because JGroups customized through GUI + running 2.249.2.3
The following stacktrace happens when your instances were updated to 2.249.2.3, and JGroups customization was performed through the GUI. The issue can be fixed by either upgrade the product to 2.249.2.4, or placing the following snipped under $JENKINS_HOME/jgroups.xml
.
020-10-12 21:25:06.280+0000 [id=1] SEVERE winstone.Logger#logInternal: Container startup failed java.lang.IllegalArgumentException: JGRP000001: configuration error: the following properties in com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING are not recognized: {remove_old_files_on_view_change=true} at org.jgroups.stack.Configurator.createLayer(Configurator.java:278) at org.jgroups.stack.Configurator.createProtocols(Configurator.java:215) at org.jgroups.stack.Configurator.setupProtocolStack(Configurator.java:82) at org.jgroups.stack.Configurator.setupProtocolStack(Configurator.java:49) at org.jgroups.stack.ProtocolStack.setup(ProtocolStack.java:475) at org.jgroups.JChannel.init(JChannel.java:965) at org.jgroups.JChannel.<init>(JChannel.java:148) at org.jgroups.JChannel.<init>(JChannel.java:122) at com.cloudbees.jenkins.ha.AbstractJenkinsSingleton.createChannel(AbstractJenkinsSingleton.java:176) at com.cloudbees.jenkins.ha.singleton.HASingleton.start(HASingleton.java:86) Caused: java.lang.Error: Failed to form a cluster at com.cloudbees.jenkins.ha.singleton.HASingleton.start(HASingleton.java:179)
<!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml --> <config xmlns="urn:org:jgroups" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups https://www.jgroups.org/schema/jgroups-4.0.xsd"> <TCP_NIO2 recv_buf_size="${tcp.recv_buf_size:128K}" send_buf_size="${tcp.send_buf_size:128K}" max_bundle_size="64K" sock_conn_timeout="1000" bind_port="${HA_BIND_PORT}" port_range="${HA_PORT_RANGE}" diagnostics_port="${HA_DIAGNOSTIC_PORT}" thread_pool.enabled="true" thread_pool.min_threads="1" thread_pool.max_threads="10" thread_pool.keep_alive_time="5000"/> <CENTRAL_LOCK /> <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING location="${HA_JGROUPS_DIR}" remove_old_coords_on_view_change="true"/> <MERGE3 max_interval="30000" min_interval="10000"/> <FD_SOCK/> <FD timeout="3000" max_tries="3" /> <VERIFY_SUSPECT timeout="1500" /> <BARRIER /> <pbcast.NAKACK2 use_mcast_xmit="false" discard_delivered_msgs="true"/> <UNICAST3 /> <!-- When a new node joins a cluster, initial message broadcast doesn't necessarily seem to arrive. Using a shorter cycles in the STABLE protocol makes the cluster recognize this dropped transmission and cause a retransmission. --> <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="4M"/> <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true" max_join_attempts="5"/> <MFC max_credits="2M" min_threshold="0.4"/> <FRAG2 frag_size="60K" /> <pbcast.STATE_TRANSFER /> <!-- pbcast.FLUSH /--> </config>
The KB Upgrade guide for instances running High Availability previous to 2.249.2.3 can give you more details about the source of the issue and how to address it.
If the promotion works on the troubleshooter and not in the instances
If the promotion works on the troubleshooter and not in the instances the following experiment is recommended:
-
Ensure that the latest version of the
cloudbees-ha
plugin is installed on both instances. -
Stop the service of the Jenkins instances
-
Add the following Java argument
-Dcom.cloudbees.jenkins.ha.level=ALL
in both instances so all the JGroups and HA logs are exposed. -
Start one instance
-
Wait 10 seconds or so
-
Start the second instance
-
Check the full logs of Jenkins instances that you can usually find under
/var/log/jenkins
After the experiment you should remove the argument -Dcom.cloudbees.jenkins.ha.level=ALL
you just added on both instances.