High Availability

High Availability (HA) on a Windows environment is not supported.

CloudBees CI on traditional platforms comes with the capability to run in a high-availability setup, where two or more JVMs form a so-called "HA singleton" cluster, to ensure business continuation for Jenkins. This improves the availability of the service against unexpected problems in the JVM, the hardware that it runs on, etc. When a Jenkins JVM becomes unavailable (for example, when it stops responding, or when it dies), other nodes in the cluster automatically take over the role of the master, thereby restoring service with minimum interruption. It is also important for users to understand what this feature does not do in its current form. Namely, it is not a symmetric cluster, where participating nodes will share workloads together. At any given point only one of the nodes is performing the master role (hence "HA singleton"). Because of this, when a failover takes place, users will experience a brief downtime, comparable to someone rebooting a Jenkins master in a non-HA setup. Builds that were in progress would be lost, too. This guide will walk through the High Level Overview, detail implementation and configuration steps, and finish with troubleshooting tips.

The described configuration assumes you are using Operations Center. You can also configure HA with CloudBees CI on traditional platforms standalone masters. Configuration differences are noted when applicable.

High-level overview

The diagram, below, outlines the principal design details with routing for each communication protocol.

ha network diagram

Alpha and Beta

To create the "HA singleton", it is necessary to have two copies of Operations Center: Alpha and Beta (active and standby, respectively). In HA mode, the two Jenkins instances cooperatively elect the "primary" JVM, and depending on the outcome of this election, members of a cluster starts/stops the Client Master in the same JVM. From the viewpoint of the code inside Operations Center, this is as if it is started/stopped programmatically. Operations Center relies on JGroups for the underlying group membership service.

By default, Operations Center uses TCP to communicate between members, with IP addresses and ports registered in a directory $JENKINS_HOME/jgroups (which all members must be able to write to). This can be changed by creating $JENKINS_HOME/jgroups.xml that describes the JGroups protocol stack configuration XML. See the JBoss Clustering documentation for the format of this file, as well as typical configuration tips and troubleshooting.

HA Failover

A failover is effectively (1) shutting down the current Jenkins JVM, followed by (2) starting it up in another location. Sometimes step 1 doesn’t happen, for example when the current master crashes. Because these masters work with the same $JENKINS_HOME, this failover process has the following characteristics:

  • Jenkins global settings, configuration of jobs/users, fingerprints, record of completed builds (including archived artifacts, test reports, etc.), will all survive a failover.

  • User sessions are lost. If your Jenkins installation requires users to log in, they’ll be asked to log in again.

During the startup phase of the failover, Jenkins will not be able to serve inbound requests or builds. Therefore, a failover typically takes a few minutes, not a few seconds.

For HA Client Masters, builds that were in progress will normally not survive a failover, although their records will survive. The builds based on Jenkins Pipeline Jobs will continue to execute. No attempt will be made to re-execute interrupted builds, though the Restart Aborted Builds plugin will list the aborted builds.

Load balancer

In general, all Jenkins traffic should be routed through the load balancer with the exception of the JNLP transport. The JNLP transport is used by Client Masters to connect to Operations Center and by Build Agents configured to use the "Java Web Start" launcher.

Operations Center accepts incoming JNLP connections from multiple components: Client Masters and Agents (usually Shared Agents). On a failover (a new Operations Center instance gets primary role in the cluster), all those JNLP connections are automatically restored.

An alternative to the proxying of TCP sockets is to use the system property hudson.TcpSlaveAgentListener.hostName to ensure that each Jenkins instance presents the correct X-Jenkins-CLI-Host header in requests. Either method has implications as to the mutual visibility of the agent nodes and the Jenkins nodes.

This hudson.TcpSlaveAgentListener.hostName system property is available to clients as an HTTP header (X-Jenkins-CLI-Host) which included in all HTTP responses. JNLP clients will use this header to determine which hostname to use for the JNLP protocol. Clients will continue attempting to connect to the JNLP port until a connection is successfully established after failover. If the system property is not provided, the host name of the Jenkins root URL is used. The hudson.TcpSlaveAgentListener.hostName must be properly set on all instances for load-balancing to work correctly in an HA configuration

There are many software and hardware load balancer solutions. CloudBees recommends the Open Source solution haproxy which doubles as a reverse proxy. For truly highly-available setup, haproxy itself needs to be made highly available. Support for HAProxy is available from HAProxy Technologies Inc.

In general, the Load Balancers in HA will be configured to:

  • Route HTTP and HTTPS traffic

  • Route SSHD TCP traffic

  • Listen for the heartbeat on /ha/health-check

  • (Optional if using HTTPS) Set location of SSL key for HTTPS

This guide will detail the setup and configuration for HAProxy. For guidance on configuring other load balancers, please see:

HTTPS and SSL

In addition to acting as a load balancer and reverse proxy, haproxy acts as a SSL Termination point for Operations Center. This allows internal traffic to remain in the default http configuration while providing a secured endpoint for users.

If you do not have access to your environments' SSL key files, please reach out to your operations teams.

If the SSL certificate used by haproxy is not trusted by default by the Operations Center JVM, it must be added to the JVM keystores of both Operations Centers and all Client Masters.

If the SSL certificate is used to secure the connection to elasticsearch, it must be added to both Operations Center JVM keystores.

Note: When making Client Masters highly available, if the SSL certificate used by haproxy is not trusted by default by the Operations Center JVM, it is recommended that it is to the keystores of both Operations Centers.

For further information, please refer to How to install a new SSL certificate

Shared storage

Each member node of a Operations Center HA cluster needs to see a single coherent shared file system that can be read and written simultaneously. That is to say, for node Alpha and Beta in the cluster, if node Alpha creates a file in $JENKINS_HOME, node Beta needs to be able to see it within a reasonable amount of time. A "reasonable amount of time" here means the time window during which you are willing to lose data in case of a failure. This is commonly accomplished with a NFS Shared File System.

To set up the NFS Server please see: NFS Guide

Note: For truly highly-available Jenkins, NFS storage itself needs to be made highly-available. There are many resources on the web describing how to do this. See the Linux HA-NFS Wiki as a starting point.

If you are using the Amazon’s Elastic File System Management service for your NFS Server, please ensure that the server’s Performance Mode is set to "Max I/O Performance Mode".

CloudBees Jenkins HA monitor tool

The CloudBees Jenkins Enterprise HA monitor tool is an optional tool that can be set up, but is not required for setting up an HA cluster. It is a small background application that executes code when the primary Jenkins JVM becomes unresponsive. Such setup/teardown scripts can only be reliably triggered from outside Jenkins. Those scripts also normally require root privileges to run. It can be downloaded from the jenkins-ha-monitor section of the download site.

Its 3 defining options are as follows:

  • The -home option that specifies the location of $JENKINS_HOME. The monitor tool picks up network configuration and other important parameters from here, so it needs to know this location.

  • The -host-promotion option that specifies the location of the promotion script, which gets executed when the primary Jenkins JVM moves in from another system into this system as a result of election. In native packages, this file is placed at /etc/jenkins-ha-monitor/promotion.sh.

  • The -host-demotion option that specifies the demotion script, which is the opposite of the promotion script and gets executed when the primary Jenkins JVM moves from this system into another system. In native packages, this file is placed at /etc/jenkins-ha-monitor/demotion.sh. Promotion and demotion scripts need to be idempotent, in the sense that the monitor tool may run the promotion script on an already promoted node, and the demotion script on an already demoted node. This can happen, for example, when a power outage hits a stand-by node and when it comes back up. The monitor tool runs the demotion script again on this node, since it cannot be certain about the state of the node before the power outage.

Run the tool with the -help option to see the complete list of available options. Configuration file that is read during start of the service is located at /etc/sysconfig/jenkins-ha-monitor.

Setup Procedure

The HA setup in Operations Centers provides the means for multiple JVMs to coordinate and ensure that the Jenkins master is running somewhere, but it does so by relying on the availability of the storage that houses $JENKINS_HOME and the HTTP reverse proxy mechanism that hides a failover from users who are accessing Jenkins. Aside from NFS as a storage and the reverse proxy mechanism, Operations Center can run on a wide range of environments. In the following sections, we’ll describe the parameters required from them and discuss more examples of the deployment mode.

The setup procedure will follow this basic format:

  • Before you begin

  • Configure Shared Storage

  • Configure Operations Center and Client Masters

  • Install Operations Center HA monitor tool

  • Configure HAProxy

After these steps are followed, you should follow the Testing and Troubleshooting section.

Configure shared storage on Alpha and Beta

All the member nodes of a CloudBees Jenkins Enterprise HA cluster need to see a single coherent file system that can be read and written to simultaneously. Creating a shared storage mount is outside the scope of this guide. Please see the Shared Storage section, above.

Mount the Shared Storage to Alpha and Beta:

$ mount -t nfs -o rw,hard,intr sierra:/jenkins /var/lib/jenkins

/var/lib/jenkins is chosen to match what the CloudBees Jenkins Enterprise packages use as $JENKINS_HOME. If you change them, update /etc/default/jenkins (on Debian) or /etc/sysconfig/jenkins (on RedHat and SUSE) to have $JENKINS_HOME point to the correct directory.

Before continuing, test that the mountpoint is accessible and permissions are correct on both machines by executing touch /<mount_point>/test.txt on Alpha and then try editing the file on Beta.

Install and configure Operations Center and Client Masters

Choose the appropriate Debian/Red Hat/openSUSE package format depending on the type of your distribution. Install CJOC on Alpha and Beta.

Upon installation, both instances of Jenkins will start running. Stop them by issuing /etc/init.d/jenkins stop, while we work on the HA setup configuration. If you don’t know how to set-up a Java Argument on Jenkins you can follow this KB article.

When Operations Center is being used in an HA configuration, both Alpha and Beta instances must modify the $JENKINS_HOME Jenkins argument to point to the shared NFS mount.

It is an important performance optimization that the WAR file is not extracted to the $JENKINS_HOME/war directory in the shared filesystem. In a rolling upgrade scenario, an upgrade of the secondary instance followed by a (standby) boot can corrupt this directory. Some configurations may do this by default, but WAR extraction can easily be redirected to a local cache (ideally SSD for better Jenkins core I/O) on the container/VM’s local filesystem with the JENKINS_ARGS properties --webroot=$LOCAL_FILESYSTEM/war --pluginroot=$LOCAL_FILESYSTEM/plugins. For example, on debian installations, where $NAME refers to the name of the jenkins instance: --webroot=/var/cache/$NAME/war --pluginroot=/var/cache/$NAME/plugins

$JENKINS_HOME is read intensively during the start-up. If bandwidth to your shared storage is limited, you’ll see the most impact in startup performance. Large latency causes a similar issue, but this can be mitigated somewhat by using a higher value in the bootup concurrency by a system property -Djenkins.InitReactorRunner.concurrency=8.

To ensure that JNLP connecting in a HA configuration, each instance must set the system property JENKINS_OPTS=-Dhudson.TcpSlaveAgentListener.hostName= equal to a hostname (preferred and required when using HTTPS) or IP address ( unsupported for HTTPS configurations) of that instance that can be resolved and connected to by all Client Masters and build agents which use the JNLP connection protocol. This requires more exposure of Operations Center servers, as all instances must be addressable by all Client Masters.

Because we are configuring HTTPS traffic, it is necessary to update the property JAVA_ARGS and add -Djavax.net.ssl.trustStore=path/to/jenkins-truststore.jks -Djavax.net.ssl.trustStorePassword=changeit". To debug ssl issues, add "-Djavax.net.debug=all" Oracle Debugging SSL/TLS Connections and How to install a new SSL certificate

When using the redhat package (RPM) it is highly recommended after the installation to set the option JENKINS_INSTALL_SKIP_CHOWN to false in /etc/sysconfig/jenkins. This option will prevent in future upgrades to apply a chown on your JENKINS_HOME folder which may take a lot of time (especially for an HA setup where JENKINS_HOME is on remote file system using NFS and will be updated by the upgrade of the two nodes).

Install Operations Center HA monitor tool

Next, we set up a monitoring service to ensure the HA failover completes. To do this, log on to Alpha and install the jenkins-ha-monitor package. This monitoring program watches Jenkins as root, and when the role transition occurs, it’ll execute the promotion script or the demotion script.

The monitor tool is packaged into a single jar file that can be executed as java -jar jenkins-ha-monitor.jar. It is also packaged as the jenkins-ha-monitor RPM/DEB packages for an easier installation on the jenkins-ha-monitor section of the download site

To install it you need to do it as any RPM file, for example:

sudo rpm -i jenkins-ha-monitor-<version>.noarch.rpm

And for uninstalling you can do it by calling:

sudo rpm -e jenkins-ha-monitor

Configuration file that is read during start of the service is located at /etc/sysconfig/jenkins-ha-monitor. CloudBees Jenkins Enterprise HA monitor tool logs by default at /var/log/jenkins/jenkins-ha-monitor.log.

Configure HAProxy

Let’s expand on this setup further by introducing an external load balancer and reverse proxy that receives traffic from users, determines the primary JVM based on heartbeat, then direct them to the active primary JVM. CloudBees recommends configuring HTTPS and this guide is tailored for HTTPS. HAProxy can be installed on most Linux systems via native packages, such as apt-get install haproxy or yum install haproxy, however CloudBees recommends that haproxy version 1.6 (or newer) be installed. For Operations Center HA, the configuration file (normally /etc/haproxy/haproxy.cfg) should look like the following:

global
        log 127.0.0.1   local0
        log 127.0.0.1   local1 notice
        maxconn 4096
        user haproxy
        group haproxy

        # Default SSL material locations
        ca-base /etc/ssl/certs
        crt-base /etc/ssl/private

        # Default ciphers to use on SSL-enabled listening sockets.
        # For more information, see ciphers(1SSL). This list is from:
        #  https://hynek.me/articles/hardening-your-web-servers-ssl-ciphers/
        ssl-default-bind-ciphers FFF+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS
        ssl-default-bind-options no-sslv3

        tune.ssl.default-dh-param 2048

defaults
        log    global
        option    http-server-close
        option    log-health-checks
        option    dontlognull
        timeout    http-request    10s
        timeout    queue           1m
        timeout    connect         5000
        timeout    client          50000
        timeout    server          50000
        timeout    http-keep-alive 10s
        timeout    check           500
        default-server inter 5s downinter 500 rise 1 fall 1

#redirect HTTP to HTTPS
listen http-in
        bind    *:80
        mode    http
        redirect scheme https code 301 if !{ ssl_fc }
listen https-in
        # change the location of the pem file
        bind    *:443 ssl crt /etc/ssl/certs/your.pem
        mode    http
        option    httplog
        option    httpchk HEAD /ha/health-check
        option    forwardfor
        option    http-server-close
        # alpha and beta should be replaced with hostname (or ip) and port
        # 8888 is the default for CJOC, 8080 is the default for {CMs}
        server    alpha alpha:8888 check
        server    beta beta:8888 check
        reqadd    X-Forwarded-Proto:\ https

listen ssh
    bind 0.0.0.0:2022
        mode      tcp
        option    tcplog
        option    httpchk HEAD /ha/health-check
        # alpha and beta should be replaced with hostname (or ip) and port
        # 8888 is the default for CJOC, 8080 is the default for {CMs}
        server    alpha alpha:2022 check port 8888
        server    beta beta:2022 check port 8888

The global section contains stock settings and defaults has been configured with typical timeout settings. You must configure the listen blocks to your particular environment. To determine the port configuration of an existing installation, please reference the How to add Java arguments to Jenkins guide.

The "Default SSL material locations" and bind *:443 ssl crt /etc/ssl/certs/server.bundle.pem sections define the paths to the necessary ssl keyfiles.

Together, the https-in and http-in sections determine the bulk of the configuration necessary for Operations Center routing. These listen blocks tell haproxy to forward traffic to two servers alpha and beta, and periodically check their health by sending a GET request to /ha/health-check. Unlike active nodes, standby nodes do not respond positively to this health check, and that’s how haproxy determines traffic routing.

The ssh services section is configured to forward tcp requests. For these services, haproxy uses the same health check on the application port. This ensures that all services fail over together when the health check fails.

Testing and troubleshooting your HA installation

HA cluster membership

When two nodes that form a cluster lose contact, each node will assume that the other had died, and will take the responsibility as the primary node. This is called a "split brain" problem. This is problematic as you end up having two independently acting Operations Centers. A similar problem can occur if one node in the cluster is severely stressed under load. The HA Monitor Tool includes a sanity check script which provides users an opportunity to apply some heuristics to reduce the likelihood of this problem.

To test for cluster membership roles, the executable the sanity-check.sh script in $JENKINS_HOME is run before a node assumes the primary role, as well as when there’s a change in the cluster membership. The use case is for you to make sure that the node should really proceed to act as the primary. If the script exits with 0, the node will boot up as the primary node, and if it exists with non-zero, the node will not act as the primary node. In addition to those checks, the sanity script allows the user to check the availability of $JENKINS_HOME, if the load balancer is routable, or if the system load is reasonably low.

HAProxy

If you are having issues with haproxy connectivity, modify the defaults section of the haproxy config to troubleshoot with these additional options:

defaults
        log     global
        # The following log settings are useful for debugging
        # Tune these for production use
        option  logasap
        option  http-server-close
        option  redispatch
        option  abortonclose
        option  log-health-checks
        mode    http
        option  dontlognull
        retries 3
        maxconn         2000
        timeout         http-request    10s
        timeout         queue           1m
        timeout         connect         5000
        timeout         client          50000
        timeout         server          50000
        timeout         http-keep-alive 10s
        timeout         check           500
        default-server  inter 5s downinter 500 rise 1 fall 1

Everything else

For additional troubleshooting tips please see How to Troubleshoot HA Installations.