High Availability and CloudBees CD (CloudBees Flow) Performance

What type of performance gain should you expect from an HA/HS setup of CloudBees CD (CloudBees Flow)?

With the release of 5.0, Electric Cloud introduced High Availability (HA) and High Scalability (HS), and some customers have questions about the benefit of these features. This article attempts to address some of those questions.

As an HA solution, spinning up additional CloudBees CD (CloudBees Flow) servers provides redundancy to help prevent a server from dropping inadvertently. This architecture change also provides the benefits of high-scalability, since the jobs are distributed to a set of machines. However, scaling up your system could cause a "weakest-link" scenario. If your database is already a bottleneck, moving to HA alone will not fix that problem. So if you are setting up an HS solution to push more work through the CloudBees CD (CloudBees Flow) system, you must ensure that your database and other aspects of your environment (such as your network and agent pools) can handle a larger load.

While the HA architecture does provide management of work to be distributed to multiple servers, CloudBees CD (CloudBees Flow) still runs only a single scheduler on one of the CloudBees CD (CloudBees Flow) servers. If the CloudBees CD (CloudBees Flow) server that is managing the scheduler experiences an unexpected drop, there is a small delay of service while another server picks up that responsibility, but no jobs should be lost.

Because there is only one scheduler across all servers, once a job begins, any steps are assigned to be managed by a single server so that all work for that job goes through that server. But the scheduler must still manage the queue of steps and determine which steps should run at what time (based on priorities, order of submission, and so on).

Each CloudBees CD (CloudBees Flow) server collects its own history inside the commander.log files, so having all steps for a single job managed by a single server makes it convenient to trace through any log output tied to that job. The exception is if a downtime event occurs on an HA server such that the job is redistributed to be run through one of the other CloudBees CD (CloudBees Flow) servers.

So are jobs ever distributed across servers? I’m having trouble, other than in a failover, to see the advantage of multiple CloudBees CD (CloudBees Flow) servers.

As stated, the implementation has a job allocated to a particular server so that the associated steps stay allocated to that server. The distribution of jobs is managed through a hashing mechanism, which will, over time, distribute jobs relatively evenly. But for a small number of large jobs, two such jobs could be allocated to the same machine. So there is no "active number of jobs" or "active number of steps" awareness in the decision to select a particular machine to use when a new job arrives.

This same "random" distribution mechanism is also used when a server drops and the jobs must be moved over to run on other servers. So the work should be fairly evenly moved to the remaining machines. If you launch single jobs that contain a very large number of steps, this system might not be as suitable for High Scalability.

The HA distribution of work is effectively a coarse-grained approach. Attempting to implement a more fine-grained (step-based) model would make the tracing of job flow more complicated (such as logs that are split between multiple nodes), but it also would require additional status traffic between the various servers when deciding which steps to move forward. These complications are avoided with the coarse-grained approach.

What is the mechanism that distributes jobs?

The server takes the ID from the job and pushes it through a hashing function, which ultimately allocates the job to an active server. Hence, it is relatively random, but it should result in fairly even distribution when measured against a larger volume of jobs. The mechanism is as follows:

A hashing algorithm pseudo-randomly allocates jobs to CloudBees CD (CloudBees Flow) server nodes.

For a sufficiently large number of jobs, this usually causes a roughly equal allocation of jobs to nodes. CloudBees CD (CloudBees Flow) uses random-choice load balancing (rather than round-robin or least connections), because that algorithm provides efficient implementation on multiple parallel server nodes without an ongoing central coordination service.
If a node goes down, its running jobs are reallocated to remaining nodes, or if a new node comes up, a proportion of the jobs running on other nodes is reallocated to it.

This uses a "consistent hashing" algorithm, which minimizes the number of jobs moved.
Message service messages for a particular job are processed by the copy of the message service running on the node responsible for that job.

Server nodes ignore the message service messages for jobs for which they are not responsible.

With the coarse-grained nature of this solution, the High Scalability of large jobs with multiple steps is not as effective as smaller jobs with fewer steps, but it helps to distribute jobs to multiple servers to allow more scalability. But note that creating an HA/HS environment with CloudBees CD (CloudBees Flow) might expose limitations in other areas of your system, so be prepared to make changes where needed.

What can cause a different load distribution between nodes?

High distribution differences between nodes can be caused by a very brief node outage. For example, if there are two nodes and one of them was rebooted. Every job the rebooted node was handling at the time will be offloaded to the other node for those few minutes, which could increase the job count substantially for that node.

Brief node outages also can be caused by a high average job runtime, or if someone makes requests directly to a specific node that might cause an unbalanced load too.

How many server nodes should we be configuring?

This question will always be relative. Job loads, jobStep loads, object creation loads, UI Refreshes, and others can all be factors in determining a "best architecture" or "sweet spot" for any given environment.

While some scenarios may be able to survive running on 2 nodes when clustering, the general recommendation when shifting from a single node system to a clustered environment, for either High-Availability (HA) and High-Scalability (HS) purposes, would be to aim for using 3 nodes as your baseline. If you are moving from a 1-node or 2-node system to 3 nodes, the bandwidth from the additional node(s) will typically put you in a state that is safe at the time. Once in a 3-node state, usage growth may continue over time, eventually getting you to a state where additional server nodes may again need to be considered. Typically it is best to connect with the Support organization if you seeing any behaviour patterns that may suggest any load limitations are being reached to require more than 3 nodes for your environment.