High Availability (active/active)

This guide provides an overview of the CloudBees CI High Availability feature and shows you how to install CloudBees CI High Availability.

High Availability capabilities and architecture

The CloudBees CI High Availability feature provides:

Controller Failover: If a controller fails, Pipeline builds normally run on that controller are automatically triggered or continued by another replica.
Load balancing: One logical controller can spread its workload across multiple replicas and keep them in sync. Refer to Build scheduling and explicit load balancing for more information.
Rolling restart with zero downtime for CloudBees CI on modern cloud platforms: If a controller replica is restarted, all the replicas keep running, and the user experiences no downtime.
Rolling upgrades with zero downtime for CloudBees CI on modern cloud platforms: If a managed controller running in HA mode must be updated to a new version, replicas are incrementally updated without restarting the managed controller and with zero downtime.
Auto-scaling for CloudBees CI on modern cloud platforms: You can set up managed controllers to increase the number of replicas, depending on the workload. They can upscale when the CPU usage overcomes a threshold and downscale when the conditions return to normal.

From a high-level perspective, these capabilities are provided using the architecture described in the image below:

Figure 1. High availability architecture

Controller replicas make High Availability possible.
The load balancer spreads the workload between the different controller replicas. Refer to Build scheduling and explicit load balancing for more information.
A shared file system to persist controller content.
Hazelcast keeps the controllers’ live state in sync.

High Availability (active/active) vs. High Availability (active/passive)

Unlike the older active-passive HA system, the mode discussed here is symmetrically active-active.

In the previous High Availability (active/passive), the cluster is not a symmetric cluster where controllers share workloads together. At any given point, only one of the replicas works as a controller. When a failover occurs, one of the replicas takes over the controller role. Users will experience a downtime comparable to rebooting a Jenkins controller in a non-HA setup.

With CloudBees CI High Availability (active/active) described in this guide, all controller replicas are always working, and the controller’s workload is spread between them. When one of the replicas fails, other replicas adopt all of its builds, and the user does not have any downtime.

Horizontal auto-scaling

Once you are running controllers with multiple replicas, you can use Kubernetes Horizontal Pod Autoscaling.

The horizontal pod autoscaling controller monitors resource utilization, and adjusts the scale of its target to match your configuration settings. For example, if utilization exceeds your defined threshold, the autoscaler increases the number of replicas.

CloudBees recommends performance testing to determine appropriate thresholds that do not affect response time.

Details for the upscale events:

No rebalance of builds is performed.
Builds continue to run on existing replicas.
New builds are dispatched between replicas. Refer to Build scheduling and explicit load balancing for more information.
Due to sticky session usage, any existing session remains on the same replica.
New sessions are distributed randomly between replicas.

Details for downscale events:

Builds from removed replicas are adopted by the remaining replicas. Refer to Build scheduling and explicit load balancing for more information.
Web sessions associated with a removed replica are redirected to a remaining replica.

Upscaling means there are increased builds to serve that consume greater resources. The scheduling of new controller replicas are blocked when the cluster reaches capacity.

To ensure that you have enough resources, CloudBees recommends the following:

Use the Kubernetes cluster autoscaler, to ensure that the cluster has enough resources to accommodate the new replicas.
Consider the use of dedicated node pools for controllers.
Assign a lower priority class to agent pods, so that controller pods are scheduled first. This allows controller pods to evict agent pods, if necessary.

Install High Availability

You can install High Availability on both CloudBees CI on traditional platforms and CloudBees CI on modern cloud platforms.

In addition to other Kubernetes environments, you can install High Availability on any of the following: Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service (EKS), or Google Kubernetes Engine.

Considerations about High Availability (HA)

High Availability (HA) and Configuration as Code (CasC)

When using CasC in controllers running in HA mode, the CloudBees Configuration as Code export and Update screen may display inconsistent information about the bundle along with two buttons: Restart and Reload. This is caused by information not being properly synchronized between replicas. Furthermore, users may experience the following problems when trying to use one of the two buttons in that page.

Automatic reload bundle: clicking this button will show an error message.
Skip new bundle version: clicking this button will force a restart and the instance will not start again.

While the fix for this issue is being worked on we recommend the following if you are using CasC in controllers running HA.

Controllers that have configured the automatic reload. Users must disable it and configure the automatic restart instead.
Controllers that don’t have any automation (Bundle Update Timing). Users must stop using the Reload button and start using the Restart button instead.

Troubleshooting HA

CloudBees CI High Availability provides, in the Manage Jenkins CloudBees CI High Availability screen, tools to troubleshoot possible problems on controller replicas running in HA mode.

The HA Developer Mode, that can be enabled by selecting Status on the screen left navigation pane. When enabled, a controller running in developer mode provides additional information like the replica used by the user or the replica executing a build.
The HA Script Console, which allows users to run scripts across all the controller replicas and gather information from all of them.

The HA Script Console is available starting on version 2.426.3.3.

Figure 2. CloudBees CI High Availability screen

Nodes and agents

CloudBees supports a range of agent connection modes in High Availability, but each agent must have only one executor. As an agent can connect to only a single replica at a time, agents with multiple executors cannot be properly shared and are not supported by CloudBees CI High Availability.

You can share a high-capacity computer among several concurrent builds, if desired, by connecting multiple agents to the replicated controller (ensure that you use a unique remote root directory so each agent has its own workspace).

Shared libraries

To use Groovy libraries, CloudBees recommends that you set them to a new “clone” mode and configure Git to use a shallow clone.

For concurrent access, if you update library checkouts in a common directory (such as $JENKINS_HOME/workspace/) or use the caching system, it can cause problems. An administrative monitor guides you to make these changes.

Non-Pipeline projects

Problems can arise in High Availability mode with non-Pipeline project types, such as freestyle, matrix, or Maven. When these project types are run, other replicas can load completed build metadata, but cannot take over the build successfully. As a result, if a replica terminates, any builds running on it are immediately aborted.

Horizontal scalability

The benefit of having multiple replicas should always be balanced against the associated cost according to your business case. Scaling horizontally with many replicas will have diminishing returns as the number of replicas increases.

Plugin installation and HA

Plugins can be managed and installed from the Manage Jenkins Plugins screen. When using HA with multiple replicas, dynamic loading of plugins (plugin installation without restarting CloudBees CI) is not supported. Therefore, you must restart each replica of the controller to install or upgrade plugins.

Figure 3. Dynamic loading of plugins not supported

In a CloudBees CI on modern cloud platforms with a managed controller running in HA mode, when selecting Restart Jenkins when installation is complete and no jobs are running, a rolling restart is performed, and when completed, new plugin versions are available in all replicas.

In a {CC-TRAD} running in HA mode with multiple replicas, you must restart all controller replicas either manually or using your own automation.

When the controller is running in HA mode with only one replica, the behaviour is the same as a non-HA controller.

HA and REST-API endpoints

When running a controller in HA mode, any API pull-based endpoint only returns information about the controller replica that responds to the API request.

Examples of these endpoints are:

The /metrics endpoint of the Metrics plugin.
The /monitoring endpoint provided by the Monitoring plugin.

In this situation, inaccurate information may be provided in the response, depending on the metrics of requested information.

For example, if making a HTTP API query for JVM heap usage, the returned value would only correspond to the replica that processed the request, and not provide any insight into other replicas. However, other information like the number of projects, is accurate, because it is automatically synchronized among all the controller replicas.

Stage View and Blue Ocean plugins

The Pipeline: Stage View and Blue Ocean plugins may not accurately display running builds that are owned by other replicas.

CloudBees recommends the CloudBees Pipeline Explorer plugin.