KBEC-00241 - Jobs are stuck after the CloudBees CD (CloudBees Flow) server is brought back up

Summary

The CloudBees CD (CloudBees Flow) server went down (e.g., your database crashed) and stayed down for more than 24 hours. After the CloudBees CD (CloudBees Flow) server is brought up, you notice that jobs that were running before the crash are now stuck (no timeouts were set).

If the server is down, the agent (where the jobs were running) will retry with successively longer pauses (up to 30 seconds) for up to 24 hours before presuming the server is dead and dropping the message.

Solution

You can do one of the following:

Manually abort the job
Restart the agent with the stuck jobs

If you restart the agent, the server will realize the agent restarted (the next time the server tries to run a command on the agent, or if the server pings the agent), and the CloudBees CD (CloudBees Flow) server will abort the running steps. This is because the agent restart is conclusive evidence that running steps from the prior agent life are no longer running.

Additionally, you can change the 24 hour limit by using the --retryTimeout global server option to change the timeout for a specific API call.

ectool --retryTimeout

Amount of time to continue retrying requests that fail due to communication errors. Defaults to --timeout value unless running in a job step, in which case the default is 24 hours.