KBEC-00480 - Why are my wait times so long?

Question

I’ve seen times where my step statistics for resource wait times are longer than usual, but it’s not consistent.
How can I find out what may be causing longer than normal wait times?

Answer

In a dynamic environment, there are numerous ways extended resource wait times may arise.
The bulk of these patterns are due to an error in expectations tied either to configurations or to the underlying code design.
Most commonly these patterns are errors in ommission, as in … someone simply didn’t know that other’s had implied expectations or restrictions on how a resource or pool should be used.

It’s important to recognize that such patterns rarely arise when you start crafting your designs, but instead show up over time. Usually only after procedures/processes start being launched with greater frequency.
In other words, these cases can become a problem where the system was working fine for months or years before being noticed. This may mean that the pattern arises after the original designers are no longer around to help explain their implied expectations.

Such problems also tend to arise inadvertently when people copy code. In particular, the selection of a resource or pool may get copied without concern for adding new loads to these existing resources.

Another common pattern is that people will craft designs that simply use the local resource during their testing phase, and not modify that when moving to broader use in production. Eventually the local resource gets overloaded, which may impact system work like schedules.

Start with considering CPU bandwidth:

Focus first on the step inside the procedure or process you know is behaving slowly.
You first want to confirm whether the resource is being assigned explicitly, implicitly through a pool defintion, or implicitly through some dynamic reference to a property that gets set in an earlier step.
It’s always worthwhile to ask - "Is this process/procedure one that’s been working fine for a long time period or something recently changed?".

If the step involved contains an explicit resource selection, then the problem may be due to too many jobs demanding this same resource.
You want to confirm what (if any) step limit is defined for this resource.
By default, a step limit is left unset for a resource, which means any number of jobSteps can be being assigned to this resource.
The problem with having no defined limit is that too many jobs assigned to the same resource will bog down that resource responsiveness, potentially making the assigned jobs run very slowly, perhaps to the point of not even having enough CPU cycles to process an incoming start request.
So the solution here might be to define an appropriate step Limit for this resource, but also to temporarily monitor how this resource is being used. Perhaps adding additional resources to help distribute the workload to more machines will be necessary.
In other words, consider using a pool instead of an explicit resource.

If the step definition already uses a pool, the number of resources inside that pool should be reviewed, as well as the sum of step Limits that these resources offer.
As above, this could be a case of too much work being assigned to not enough CPU’s, but this becomes less likely the larger the number of steps you have available (and the rate at which such steps are completing).
If it’s that straght forward, you may simply want to consider adding more resources to this pool.
If adding machines will take time to requisition, then implementing some means of throttling your demand for these resources would be another way to avoid bottlenecks in the short term.
A throttling approach may simply mean making jobs wait before starting to demand any actual "work" steps. This technique can be implemented earlier in the process to limit how many of these jobs are demanding the resources in this pool during a particular time window.

If the resource was not explicit, or a pool reference, but instead is defined through code or a property, then you may need to review that code to ensure that the property contains the resource you expected it to contain.
Otherwise, the review here should be the same as in the explicit resource case outlined above.

An example of throttling how many jobs compete for resoures:

Assume you have a procedure that contains a "work" section with 100 steps running in parallel.
If 10 of these jobs are running concurrently, this means you need 1000 jobStep "slots" to complete your work. You have reviewed the pool involved called, "poolx" and determine that you have 50 machines, where you tested that 10 active steps will work safely on each machine.
So this gives you a maximum of only 500 steps at any one time. Any number of running jobs less than or equal to 5 will fully allocate their work. Having 10 jobs run at the same time will result in competition for who gets assigned resources, slowing down all 10 of the jobs to finish.

To create a throttle, you can create a set of faux-resources. These won’t do any real work, they simply get used to limit the # of jobs that will move to the "work" stage of their process. For this case, create 5 resources called "poolx_slot1", "poolx_slot2" etc, set each of these 5 resources to have a step limit of 1, and place them all into a pool, "poolx_jobcontrol" Before the 100 parallel steps begin, add a step that takes from "poolx_jobcontrol" and mark that step to use job exclusivity. Job exclusivity will result in that resource being "owned" by this job until it is released. Releasing can be handled by adding a step near the bottom of the procedure designated to release the resource, or else let it be freed implicitly when the entire job completes. The result is that the 6th job will not be able to take a resource from the "poolx_jobcontrol" pool, and thus can’t reach the "work" steps until 1 of the first 5 jobs has completed.
This throttle enures 5 jobs will be vying to grab resources from the pool responsible for the "work" at any one time. This will allow the first jobs to complete faster as they will face less time waiting for a worker-resource.
It’s true that the last 5 jobs will still face wait times here, but this approach enures these wait times will be clearly logged against the jobcontrol pool and not the work pool. But putting less steps onto the scheduler to try and find an open resource will allow it to run more efficiently.
Also notice, that the number of faux-resources you set for this pool could depend on the type of work at hand. IF the steps free up fairly quickly, perhaps 6 or 7 concurrent jobs with competition will be deemed accptable. This becomes an iteration for the user to determine the sweet spot that suits best for their environment.

It doesn’t seem like a CPU overload issue - what else should we look at?

StepLimit of 1

A resource with a stepLimit of 1 can act as a semaphore that ensures only 1 job assignmeent can use that resource at any 1 time.

This is essentially just a special case from the patterns above, where the resource selected is defined with only 1 step for use.
If too many jobsteps are vying for this resource, perhaps you need to revisit why this resource was setup to force only 1-step-at-a-time. If a unique application requires a 1-step-at-a-time process, this might not be avoided.
If the work can be distributed, then perhaps adding more machines (via a pool) to perform the same work uniquely on an isolated resource will be necessary to reduce the bottlenecks.

It’s recommend that the Resource Description field of the Resource be used to explain why resources have their assigned step limit, or else documented in some external architecture outline so that others can be made aware of decisions intended to limit their use.

Resource Exclusivity

Using exclusivity rules for resources gives the power to ensure a resource is only working on a given task (or set of tasks in the same job).
However, exclusivity needs to be properly managed, as it will create temporary lock-out periods.
Other users need to understand this possibility when selecting which resource they want their jobs to be using.
As a best practice, a resource expected to be used with exclusivity should be singled out to limit risk of the resource being selected by others in some generic fashion.

It’s also true that Exclusivity should be avoided for steps assigned with low-priority.
A low-priority step will have it’s work placed at the bottom of the scheduler queue. If exclusivity is set against such steps, AND the system is running with very large loads, then the time for the scheduler to reach this assignment can become delayed, creating a potential short-term deadlock.
It’s "short term" until the normal or high-priority workloads are reduced. How long that will be is relative.
Some customers may schedule large loads of work for many hours, so the unintended lockout could take hours, and leave teams expecting to be assigned this resource scratching their heads on why things got stuck.

Another current reality in the UI is that exclusivity is not identified from the resource itself.
This won’t make it clear that a resource is being limited to only 1 step or job.
The resource page of the UI can be used to help see which steps are currently running on a resource, but in the case of exclusivity, you may percieve a disconnect in the number of running steps (only 1 step for the exclusive work) against a resource with a much larger step Limit.
The impression is that there must be step slots available for other work to be assigned to this resource.
In such cases, the only way to confirm what’s causing the steps to wait is to review the commander.log files for lines associated with the scheduler which will show why the resource is unavailable for an otherwise "Runnable" step.
(See Step Status values).

Network troubles affecting agent access

If a resource can’t be reached for an extended time period, perhaps the problem lies inside the network.
If it’s a resource in the default zone, then you should be able to confirm if the resource is up, and if so, when it was last pinged.
You may also want to logon to that resource to review the jagent.log (or possibly agent.log) files for the agent to confirm if there is any lines that may identify troubles communicating with the server.

If the resource is situated in a different zone, you will need to check the above, but you also must check the status of the intermediate gateway agents that are being used to facilitate this agent’s connection to the server.

Similarly, a proxy host will require that you confirm the proxy agent remains accessible.

Workspace status?

Another consideration for the destination resource is whether the associated workspace is available.
Perhaps a mounted filer has gone down, or disk space is full to the point that the resource can’t be assigned to this step.

Environment

CloudBees CD (CloudBees Flow)