High Availability capabilities and architecture
The CloudBees CI High Availability feature provides:
-
Controller Failover: If a controller fails, Pipeline builds normally run on that controller are automatically triggered or continued by another replica.
-
Horizontal scaling: One logical controller spreads its workload across multiple replicas evenly. Refer to Build scheduling and explicit load balancing for more information.
-
Rolling restart with zero downtime for CloudBees CI on modern cloud platforms: If a controller is restarted, replicas are replaced one by one, and the user experiences no downtime.
-
Rolling upgrades with zero downtime for CloudBees CI on modern cloud platforms: If a managed controller running in HA mode must be updated to a new version, replicas are incrementally updated without restarting the managed controller and with zero downtime.
-
Auto-scaling for CloudBees CI on modern cloud platforms: You can set up managed controllers to increase the number of replicas, depending on the workload. They can upscale when the CPU usage overcomes a threshold and downscale when the conditions return to normal.
From a high-level perspective, these capabilities are provided using the architecture described in the image below:
-
Controller replicas make High Availability possible.
-
The load balancer spreads the workload between the different controller replicas. Refer to Build scheduling and explicit load balancing for more information.
-
A shared file system to persist controller content.
-
Hazelcast keeps the controllers’ live state in sync.
-
If needed, CloudBees CI HA uses a reversed-proxy HTTP or TCP connection to access from one replica resources belonging or connected to other controller replicas. Resources like:
-
Running builds.
-
WebSocket inbound agents.
-
TCP Inbound agents from outside the CloudBees CI network.
As described above, as a requirement for HA, controller replicas must be able to connect using TCP connections for the Hazelcast nodes and by IP or hostname via HTTP.
If an HTTP Proxy Configuration is used, replicas IPs and hostnames must be added to the non-proxy host list.
-
High Availability (active/active) vs. High Availability (active/passive)
Unlike the older active-passive HA system, the mode discussed here is symmetrically active-active.
In the previous High Availability (active/passive), the cluster is not a symmetric cluster where controllers share workloads together. At any given point, only one of the replicas works as a controller. When a failover occurs, one of the replicas takes over the controller role. Users will experience a downtime comparable to rebooting a Jenkins controller in a non-HA setup.
With CloudBees CI High Availability (active/active) described in this guide, all controller replicas are always working, and the controller’s workload is spread between them. When one of the replicas fails, other replicas adopt all of its builds, and the user does not have any downtime.
Horizontal auto-scaling
Horizontal auto-scaling is available for managed controllers running in HA mode. It uses Kubernetes Horizontal Pod Autoscaling.
The Horizontal Pod Autoscaling (HPA) controller monitors resource utilization and adjusts the scale of its target to match your configuration settings. For example, if utilization exceeds your defined threshold, the autoscaler increases the number of replicas. You can learn how to set up auto-scaling for your managed controller here.
Kubernetes Horizontal Pod Autoscaling (HPA) is part of the Helm charts provided by CloudBees. However, to use auto-scaling with CloudBees CI High Availability (HA), you need to install the required Metrics server in your cluster. CloudBees CI’s High Availability (HA) has been tested using |
CloudBees recommends performance testing to determine appropriate thresholds that do not affect response time.
Details for the upscale events:
-
No rebalance of builds is performed.
-
Builds continue to run on existing replicas.
-
New builds are dispatched between replicas. Refer to Build scheduling and explicit load balancing for more information.
-
Due to sticky session usage, any existing session remains on the same replica.
-
New sessions are distributed randomly between replicas.
Details for downscale events:
-
Builds from removed replicas are adopted by the remaining replicas. Refer to Build scheduling and explicit load balancing for more information.
-
Web sessions associated with a removed replica are redirected to a remaining replica.
Upscaling means there are increased builds to serve that consume greater resources. The scheduling of new controller replicas are blocked when the cluster reaches capacity.
To ensure that you have enough resources, CloudBees recommends the following:
-
Use the Kubernetes cluster autoscaler, to ensure that the cluster has enough resources to accommodate the new replicas.
-
Consider the use of dedicated node pools for controllers.
-
Assign a lower priority class to agent pods, so that controller pods are scheduled first. This allows controller pods to evict agent pods, if necessary.