High Availability installation troubleshooting

7 minute read

If the CloudBees High Availability (HA) Plugin is not working as expected, use this page to troubleshoot.

Versions

Ensure that:

  • All instances must run the same CloudBees CI (or Jenkins) version

  • All instances must run using the same JDK version

Prerequisites

Verify that both yours nodes have the following configuration:

  • JAVA_OPTS contains -Dhudson.TcpSlaveAgentListener.hostName=<HOSTNAME>. Note that if the nodes and/or JNLP agents are in different networks, then the FQDN is needed (If there is not Public Domain Name, use a static IP instead).

  • JENKINS_ARGS contains --webroot=$LOCAL_FILESYSTEM/war and --pluginroot=$LOCAL_FILESYSTEM/plugins.

Required data for troubleshooting

This section describes how to collect the following minimum required information for troubleshooting HA issues.

  • Network infrastructure description

  • Support bundle of the instance in HA

  • JGroups customization (optional)

  • Output from Troubleshooting application

  • Logs from Active and Secondary node

If the required data is bigger than 20 MB, refer to Generating a support bundle - Sending a support bundle too large for Zendesk.

Network infrastructure description

A brief description of your network infrastructure including:

Support bundle

A support bundle from the Jenkins instance while the issue is exposed. Refer to Generating a support bundle for instructions.

JGroups customization (optional)

Default JGroups will pick up random ports unless you configure jgroups.xml to use fixed ports.

If you have configure jgroups.xml, please attach to the ticket.

Output from Troubleshooting application

To simplify the troubleshooting process of the network issues, CloudBees has published the troubleshooter program. This program runs the same lower level stack as Jenkins HA, and thus exercises the network in the exact same fashion. When you type in a text from stdin and hit Enter, you should see the text echoed on all nodes of the cluster (including the node in which you typed the text).

A good first step to diagnose the network problem is to run two instances of the troubleshooter program on the same host and see if they can communicate with each other. Then do the same on the other host. In this way, you can further isolate the problem.

Two options:

Non-Logging mode

Run the following command on both instances to determinate if primary and backup nodes are selected correctly. You need to go to both instances and run the following command (change $JENKINS_HOME to the corresponding value):

java -DJENKINS_HOME=$JENKINS_HOME -DHA_JGROUPS_DIR=$JENKINS_HOME/jgroups/ -Djgroups.bind_addr=<IP_ADDRESS>  -Djava.net.preferIPv4Stack=true -jar troubleshooter-<VERSION>-jar-with-dependencies.jar

Logging mode

In case the promotion process does not work correctly, such as both nodes run as primary node, run the troubleshooter application on logging mode to expose the problem. (Change $JENKINS_HOME to the corresponding value.)

java -DJENKINS_HOME=$JENKINS_HOME -DHA_JGROUPS_DIR=$JENKINS_HOME/jgroups/ -Djgroups.bind_addr=<IP_ADDRESS>  -Dlogging.org.jgroups=ALL -Dlogging.com.cloudbees.jenkins.ha=ALL -Djava.net.preferIPv4Stack=true -Dha-troubleshooter.filelogging -jar troubleshooter-<VERSION>-jar-with-dependencies.jar

Note about file logging

The -Dha-troubleshooter.filelogging enables file logging with log rotation. By default, it rotates at 100 MB. The use case is to be able to let it run in background while waiting for the issue to reoccur.

If you need to cover a bigger period of time, you may want to also use -Dha-troubleshooter.filelogging.count=NN to raise the default value of 10. For example, to cover a whole weekend duration, you may want to use -Dha-troubleshooter.filelogging.count=100 and rotate on 100 files of 10 MB, to consume a maximum of 1 GB of disk space.

When the tool starts, it will display the values for all those configuration so that you can make sure it was taken in account. Something like:

Logs File Rotation enabled: # of files: 10, max size per file: 10000000, pattern: ha-troubleshooting.abcd.%u.log

The tool generates a random four hexa digits in the file name to avoid clashing with existing one, when for example running the tool on many nodes of a HA cluster.

Logs from Active and Passive node

Provide the both, active and passive nodes, Jenkins logs such as /var/log/jenkins/jenkins.log in Ubuntu/Debian.

operations center and Client Master nodes do not form a cluster

See the JGroups troubleshooting guide for typical problems. When nodes don’t form a cluster, it is normally either because the protocol needs additional configuration or there’s a problem in the network configuration of the operating system or the network equipment (for example, nodes cannot “see” one another via TCP).

Ensure that all the instances are using the same Unix user:group

Ensure that the userID used to run the Jenkins process is the same on the 3 of the servers: NFS, Jenkins primary, and Jenkins failover.

Run on the three instances the following commands and ensure that the ID used is the same for the user. For this example, jenkins:jenkins is used.

id -u jenkins

In case the ID is not the same on the three instances, the following Unix commands can be used to create a jenkins:jenkins - which is needed in the NFS instance and to modify the ID of the user and group.

sudo useradd jenkins
sudo usermod -u [jenkins UID] jenkins
sudo groupadd jenkins
sudo groupmod -g [jenkins GID] jenkins

Ensure that the owner of the $JENKINS_HOME is the user which run the Jenkins process

The user which owns the Jenkins process should be the owner of the $JENKINS_HOME. Run ls -la $JENKINS_HOME to check this. If you need to change the owner, the following command can be used.

sudo chown -R jenkins:jenkins $JENKINS_HOME

Customize JGroups to use fixed ports if running behind a firewall

By default, the CloudBees HA plugin uses a random port to communication. If the instances are running behind a firewall, you have two options to configure the fixed ports.

The following examples mean that you need to open the following ports on the firewall: 56736, 56737, 56738, 56739, 56740 and 35483.

From the UI

Starting from CloudBees High Availability 4.8 it is possible to configure the ports via the UI. Go to Manage Jenkins Configure System, and locate the High Availability Configuration section. Enable the customization, specify the ports and restart both nodes in HA singleton.

Figure 1. Enable customized JGroups configuration

From the configuration file

Another option is to place the following snippet in $JENKINS_HOME/jgroups.xml.

The following elements need to be customized:

  • bind_port

  • port_range

  • diagnostics_port

bind_port="56736"
port_range="5"
diagnostics_port="35483"
<!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml -->
<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.1.xsd">
    <TCP_NIO2
         recv_buf_size="${tcp.recv_buf_size:128K}"
         send_buf_size="${tcp.send_buf_size:128K}"
         max_bundle_size="64K"
         max_bundle_timeout="30"
         sock_conn_timeout="1000"

         bind_port="56736"
         port_range="5"
         diagnostics_port="35483"

         timer_type="new"
         timer.min_threads="4"
         timer.max_threads="10"
         timer.keep_alive_time="3000"
         timer.queue_max_size="500"

         thread_pool.enabled="true"
         thread_pool.min_threads="1"
         thread_pool.max_threads="10"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="false"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="discard"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="1"
         oob_thread_pool.max_threads="8"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="discard"/>

    <CENTRAL_LOCK />

    <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING
             location="${HA_JGROUPS_DIR}"
             remove_old_coords_on_view_change="true"
             remove_all_files_on_view_change="true"/>
    <MERGE2 max_interval="30000"
            min_interval="10000"/>
    <FD_SOCK/>
    <FD timeout="3000" max_tries="3" />
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK2 use_mcast_xmit="false"
                   discard_delivered_msgs="true"/>
    <UNICAST />
    <!--
      When a new node joins a cluster, initial message broadcast doesn't necessarily seem
      to arrive. Using a shorter cycles in the STABLE protocol makes the cluster recognize
      this dropped transmission and cause a retransmission.
    -->
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"
                max_join_attempts="5"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
    <pbcast.STATE_TRANSFER />
    <!-- pbcast.FLUSH  /-->
</config>

Bind JGroups with the right network interface

If any of the two instances used to form a cluster has more than one network interface, you must tell JGroups which one should be used to listen for packets. The following two Java arguments are used for this purpose:

-Djgroups.bind_addr=<IP_ADDRESS>
-Djava.net.preferIPv4Stack=true

where <IP_ADDRESS> is the IP address of the network interface in the instance which should be able to reach out the other node in the cluster.

In case of any issue with the CloudBees High Availability plugin, these arguments should be set up to ensure JGroups is bound to the right interface.

If both nodes are failing are being running as a singleton mode

There is a miscommunication between the nodes of the JGroup cluster, thus a similar trace like the following is being shown in both nodes

-------------------------------------------------------------------
GMS: address=stg-rbl-jnk-mas-w2a-a-43979, cluster=Jenkins, physical address=10.65.45.107:38926
-------------------------------------------------------------------
Oct 10, 2018 5:09:08 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 1
Oct 10, 2018 5:09:11 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 2
Oct 10, 2018 5:09:14 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 3
Oct 10, 2018 5:09:17 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 4
Oct 10, 2018 5:09:20 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
WARNING: stg-rbl-jnk-mas-w2a-a-43979: JOIN(stg-rbl-jnk-mas-w2a-a-43979) sent to stg-rbl-jnk-mas-w2a-a-18558 timed out (after 3000 ms), on try 5
Oct 10, 2018 5:09:20 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
WARNING: stg-rbl-jnk-mas-w2a-a-43979: too many JOIN attempts (5): becoming singleton
Oct 10, 2018 5:09:20 AM com.cloudbees.jenkins.ha.singleton.HASingleton$3 viewAccepted
INFO: Cluster membership has changed to: [stg-rbl-jnk-mas-w2a-a-43979|0] (1) [stg-rbl-jnk-mas-w2a-a-43979]
Oct 10, 2018 5:09:20 AM com.cloudbees.jenkins.ha.singleton.HASingleton$3 viewAccepted
INFO: New primary node is JenkinsClusterMemberIdentity[member=stg-rbl-jnk-mas-w2a-a-43979,weight=0,min=0]
Oct 10, 2018 5:09:20 AM com.cloudbees.jenkins.ha.singleton.HASingleton reactToPrimarySwitch
INFO: Elected as the primary node

Action: Delete all the content inside $JENKINS_HOME/jgroups to reset the JGroups cluster. The content will be regenerated after the next restart of the active and passive nodes.

If the promotion works on the troubleshooter and not in the instances

If the promotion works on the troubleshooter and not in the instances the following experiment is recommended:

  1. Ensure that the latest version of the cloudbees-ha plugin is installed on both instances.

  2. Stop the service of the Jenkins instances.

  3. Add the following Java argument -Dcom.cloudbees.jenkins.ha.level=ALL in both instances so all the JGroups and HA logs are exposed.

  4. Start one instance.

  5. Wait 10 seconds or so.

  6. Start the second instance.

  7. Check the full logs of Jenkins instances that you can usually find under /var/log/jenkins.

After the experiment you should remove the argument -Dcom.cloudbees.jenkins.ha.level=ALL you just added on both instances.

Test HA cluster membership

When two nodes that form a cluster lose contact, each node will assume that the other had died, and will take the responsibility as the primary node. This is called a "split brain" problem. This is problematic as you end up having two independently acting operations centers. A similar problem can occur if one node in the cluster is severely stressed under load. The HA Monitor Tool includes a sanity check script which provides users an opportunity to apply some heuristics to reduce the likelihood of this problem.

To test for cluster membership roles, the executable the sanity-check.sh script in $JENKINS_HOME is run before a node assumes the primary role, as well as when there’s a change in the cluster membership. The use case is for you to make sure that the node should really proceed to act as the primary. If the script exits with 0, the node will boot up as the primary node, and if it exists with non-zero, the node will not act as the primary node. In addition to those checks, the sanity script allows the user to check the availability of $JENKINS_HOME, if the load balancer is routable, or if the system load is reasonably low.

HAProxy connectivity issues

If you are having issues with HAProxy connectivity, modify the defaults section of the HAProxy config to troubleshoot with these additional options:

defaults
        log     global
        # The following log settings are useful for debugging
        # Tune these for production use
        option  logasap
        option  http-server-close
        option  redispatch
        option  abortonclose
        option  log-health-checks
        mode    http
        option  dontlognull
        retries 3
        maxconn         2000
        timeout         http-request    10s
        timeout         queue           1m
        timeout         connect         5000
        timeout         client          50000
        timeout         server          50000
        timeout         http-keep-alive 10s
        timeout         check           500
        default-server  inter 5s downinter 500 rise 1 fall 1