Why do Shared Agents / Cloud show as suspended status while jobs wait in the queue?

Issue

operations center agents show as suspended while jobs wait in the client controller queue.
Why are my agents showing up as suspended?
How to switch to a different strategy for my operations center agents?

Environment

Resolution

Starting points

Under the described environment, controllers can share agents, which means that they will be competing for the same agents.

The pattern seen is that the operations center agent(s) get picked up by the first controller. The first controller has jobs being queued at a rate sufficient that the agent(s) is/are always in use.

Once an operations center leases a shared agent to a controller, the rest of the controllers do not have access to that agent until it is released. By default, each of the executors of the operations center agent can only be used once by the controller. Once that operations center agent goes back under the operations center control, it is ready to be leased to another or the same controller.

Each controller has its own queue. operations center does not know about the sizes of the queues of the controllers. Most critically of all, each controller is unaware of the needs of the other controllers.

There are several factors that come into play:

The number of executors per agent
The rate of jobs arriving
The length of time that each job takes to complete
The strategy for retaining agents on a controller

Strategies

Four different strategies have been tested against various conditions.

Single shot - where the agent is retained until at least one build has completed.
Single shot (multi-executor) - where the agent is retained until at least one build has completed and at most one build per executor has completed.
Multi shot - where the agent is retained until at least one build has completed and at most a configurable fixed number of builds as completed.
Monopolizing - where the agent is retained until there are no jobs in the queue for the agent to execute builds on.

operations center 1.0 was released with a Single-shot retention strategy and the recommendation to configure shared agents for one executor only. Single-shot gives good and consistent behavior. The key metric (average time between entering the queue and starting execution) would behave as you would expect based on experience with a standalone controller. i.e., it is purely a function of the size of the work pool, the rate of jobs arriving, and the length of time jobs spend executing (with some modification for the provisioning delay required as the length of time executing nears 0).

It can be tricky to configure many agents (there are some techniques to make it easier), so in response to requests to support more than one executor per agent, Single shot (multi-executor) is the solution. This, gave the highest throughput of builds when looked at as a cluster average. So the operations center 1.1 release switched to this strategy as the default.

When you have a single controller, the overhead of lease and release means that the Monopolising strategy is the best. As soon as you have more than one controller (and the whole point of operations center is that you have more than one controller), if there is any contention at all for the agents that the controller needs, then the Monopolising Strategy leads to large build queues on controllers that were not leased agents yet. What you then see is that the number of jobs in the queue on the other controllers increases, and the average time between jobs entering the queue and starting execution increases across the cluster.

Switching to Multi-shot counteracts this issue. Multi-shot has been tested with the count at 100, 50 and 20 builds. What has been found was that the probability of the key metric (average time between entering the queue and starting execution) across the cluster increased as the count increased, in other words, 20 was less likely than 50 to go bad, and 50 was less likely than 100.

None monopolising strategies keep a count of the number of builds that have started on the agent. As soon as that count goes above the strategies limit, it marks the agent as suspended (which is how you tell Jenkins not to assign any more work to that agent).

How to switch to a different strategy?

This is configured via the Java system property of the controller. Please use the corresponding Java argument from the list below (How to add Java arguments to Jenkins?).

Single shot (by default) - -Dcom.cloudbees.opscenter.client.cloud.CloudImpl.retentionStrategyShotCount=1.
Single shot (multi-executor) - Single shot + # of executors > 1 (in the Operation Center side Shared Agent / Cloud Configuration).
Multi shot - -Dcom.cloudbees.opscenter.client.cloud.CloudImpl.retentionStrategyShotCount=<NUMBER_GREATER_THAN_1> (e.g. -Dcom.cloudbees.opscenter.client.cloud.CloudImpl.retentionStrategyShotCount=2).
Monopolising - -Dcom.cloudbees.opscenter.client.cloud.CloudImpl.retentionStrategyShotCount=<NUMBER_LESS_THAN_0> (e.g. -Dcom.cloudbees.opscenter.client.cloud.CloudImpl.retentionStrategyShotCount=-1).

In case you want to test it "on the fly", go to the Script Console and use the following script. Please replace <NUMBER> with the value as described in the above list.

com.cloudbees.opscenter.client.cloud.CloudImpl.retentionStrategyShotCount=<NUMBER>