Required Data: HA (active/passive) issues

Article ID:115002341252
3 minute readKnowledge base

Issue

High Availability (active/passive) is not working as expected.

Prerequisite

Verify that both yours nodes have the following configuration:

  • JAVA_OPTS contains -Dhudson.TcpSlaveAgentListener.hostName=<HOSTNAME>. Note that if the nodes and/or JNLP agents are in different Networks then the FQDN is needed (If there is not Public Domain Name, use a static IP instead).

  • JENKINS_ARGS contains --webroot=$LOCAL_FILESYSTEM/war and --pluginroot=$LOCAL_FILESYSTEM/plugins

Required Data HA issues

This article describes how to collect the minimum required information for troubleshooting HA (active/passive) issues.

If the required data is bigger than 50 MB you will not be able to use ZenDesk to upload all the information. On this case we would like to encourage you to use our upload service in order to attach all the required information.

Required Data check list

  • Network infrastructure description

  • Support bundle of the instance in HA

  • JGroups customization (optional)

  • Output from Troubleshooting application

  • Logs from Active and Secondary node

Network infrastructure description

A brief description of your network infrastructure including:

  • Is there any firewall (or any other interposed network device) in the middle of both nodes?. Note: Firewalls needs a customized jgroups.xml see High Availability Installations Troubleshooting

  • Do you have several network interfaces on those instances?

Support bundle

A support bundle from the Jenkins instance while the issue is exposed. Please, follow the KB below in case you don’t know how to generate a support bundle.

JGroups customization [optional]

Default jgroups will pick up random ports unless we configure jgroups.xml

If you have configure jgroups.xml, please attach to the ticket.

Output from Troubleshooting application

To simplify the troubleshooting process of the network issues, we have published the troubleshooter program. This program runs the same lower level stack as Jenkins HA, and thus exercises the network in the exact same fashion. When you type in a text from stdin and hit enter, you should see the text echoed on all nodes of the cluster (including the node in which you typed the text.)

A good first step to diagnose the network problem is to run two instances of the troubleshooter program on the same host and see if they can communicate with each other. Then do the same on the other host. In this way, you can further isolate the problem.

Two options:

Non-Logging mode

Run the following command on both instances to determinate if primary and backup nodes are selected correctly. You need to go to both instances and run the following command (Please, change $JENKINS_HOME for the corresponding value):

java -DJENKINS_HOME=$JENKINS_HOME -DHA_JGROUPS_DIR=$JENKINS_HOME/jgroups/ -Djgroups.bind_addr=<IP_ADDRESS>  -Djava.net.preferIPv4Stack=true -jar troubleshooter-<VERSION>-jar-with-dependencies.jar

Logging mode

In case the promotion process does not work correctly, i.e both nodes run as primary node, run now the troubleshooter application on logging mode to expose the problem.

java -DJENKINS_HOME=$JENKINS_HOME -DHA_JGROUPS_DIR=$JENKINS_HOME/jgroups/ -Djgroups.bind_addr=<IP_ADDRESS>  -Dlogging.org.jgroups=ALL -Dlogging.com.cloudbees.jenkins.ha=ALL -Djava.net.preferIPv4Stack=true -Dha-troubleshooter.filelogging -jar troubleshooter-<VERSION>-jar-with-dependencies.jar

The output of the Troubleshooting application working with/without logging

Note about file logging

The -Dha-troubleshooter.filelogging will enable file logging with log rotation. This will by default rotate on 100 MB. The use case is to be able to let it run in background while waiting for the issue to reoccur.

If you need to cover a bigger period of time, you may want to also use -Dha-troubleshooter.filelogging.count=NN to raise the default value of 10. For example, to cover a whole week-end duration, you may want to use -Dha-troubleshooter.filelogging.count=100 and rotate on 100 files of 10 MB, to consume a maximum of 1 GB of disk space.

When the tool starts, it will display the values for all those configuration so that you can make sure it was taken in account. Something like:

Logs File Rotation enabled: # of files: 10, max size per file: 10000000, pattern: ha-troubleshooting.abcd.%u.log

The tool generates a random four hexa digits in the file name to avoid clashing with existing one, when for example running the tool on many nodes of a HA cluster.

Logs from Active and Passive node

Provide the both, active and passive nodes, Jenkins logs i.e. /var/log/jenkins/jenkins.log in Ubuntu/Debian. For other type of installations, get the default log locations here.