Why do Shared Agents / Cloud show as suspended status while jobs wait in the queue?

Article ID:204690520
4 minute readKnowledge base

Issue

  • Operation Center agents show as suspended while jobs wait in the client controller queue.

  • Why are my agents showing up as suspended?

  • How to switch to a different strategy for my OC agents?

Resolution

Starting points

Under the described environment, controller can share agents…​ which means that they will be competing for the same agents

The pattern seen is that the OC agent(s) get picked up by the first controller. The first controller has jobs being queued at a rate sufficient that the agent(s) is/are always in use.

Once a OC agent serves its executors to a controller, the rest of the controller do not have access to that OC agent until is released. Under the executing service, one OC agent can just each of its executor once for the same controller. Once that OC agent goes back under the OC control recover all its defined executors, ready for the next or the same controller.

Each controller has its own queue and OC does not know the size of the queue that it is being requested to provision against. Most critically of all, each controller is unaware of the needs of the other controllers.

There are a number of different factors that come into play:

  • The number of executors per agent

  • The rate of jobs arriving

  • The length of time that each job takes to complete

  • The strategy for retaining agents on a master

Strategies

Four different strategies have been tested against a variety of conditions.

  • Single shot - where the agent is retained until at least one build has completed.

  • Single shot (multi-executor) - where the agent is retained until at least one build has completed and at most one build per executor has completed.

  • Multi shot - where the agent is retained until at least one build has completed and at most a configurable fixed number of builds as completed.

  • Monopolising - where the agent is retained until there are no jobs in the queue for the agent to execute builds on.

OC 1.0 was released with Single-shot retention strategy and the recommendation to configure shared agents for one executor only. Single-shot gives very good and consistent behaviour. The key metric (average time between entering the queue and starting execution) would behave as you would expect based on experience with a standalone controller. i.e. it is purely a function of the size of the work pool, the rate of jobs arriving and the length of time jobs spend executing (with some modification for the provisioning delay required as the length of time executing nears 0).

Obviously, it can be tricky to configure many agents (there are some techniques to make it easier), so in response to requests to support more than one executor per agent, Single shot (multi-executor) comes to action. This, unsurprisingly, gave the highest throughput of builds when looked at as a cluster average. So for the OC 1.1 release switched to this strategy as our default.

When you have a single controller, the overhead of lease and release means that the Monopolising strategy is the best. As soon as you have more than one controller (and the whole point of OC is that you have more than one controller), is there is any contention at all for the agents that the controller need then the Monopolising Strategy leads to queue explosions…​ this is especially the case where the agents have more than one executor. As mentioned, once OC agents serves to a controller, other controllers never get a chance to use that agent (hence why it is called Monopolising strategy). What you then see is that the number of jobs in the queue on the other controllers explode and the key metric (average time between entering the queue and starting execution) across the cluster skyrockets.

Switching to Multi-shot counteracts this issue. Multi-shot has been tested with the count at 100, 50 and 20 builds. What has been found was that the probability of the key metric (average time between entering the queue and starting execution) across the cluster skyrocketing increased as the count increased, IOW 20 was less likely than 50 to go rouge, 50 was less likely than 100.

None monopolising strategies keep a count of the number of builds that have started on the agent. As soon as that count goes above the strategies limit, it marks the agent as suspended (which is how you tell Jenkins not to assign any more work to that agent).

How to switch to a different strategy?

This is configured via the Java system property of the controller. Please use the corresponding Java argument from the list below (How to add Java arguments to Jenkins?).

  • Single shot (by default) - -Dcom.cloudbees.opscenter.client.cloud.CloudImpl.retentionStrategyShotCount=1.

  • Single shot (multi-executor) - Single shot + # of executors > 1 (in the Operation Center side Shared Agent / Cloud Configuration).

  • Multi shot - -Dcom.cloudbees.opscenter.client.cloud.CloudImpl.retentionStrategyShotCount=<NUMBER_GREATER_THAN_1> (e.g. -Dcom.cloudbees.opscenter.client.cloud.CloudImpl.retentionStrategyShotCount=2).

  • Monopolising - -Dcom.cloudbees.opscenter.client.cloud.CloudImpl.retentionStrategyShotCount=<NUMBER_LESS_THAN_0> (e.g. -Dcom.cloudbees.opscenter.client.cloud.CloudImpl.retentionStrategyShotCount=-1).

In case you want to test it "on the fly" go to the Script Console and use the following script. Please replace <NUMBER> with the value as described in the above list.

com.cloudbees.opscenter.client.cloud.CloudImpl.retentionStrategyShotCount=<NUMBER>