Issue
-
Pipeline builds are intermittently interrupted with the message
cancelling shell steps running on <node name>
. Followed byAgent was removed
.
ERROR: also canceling shell steps running on <agentName> [...] Agent was removed
Explanation
A 5-minute periodic task looks for pipeline runs stuck within a node
block (i.e. ExecutorStepExecution
) and aborts them if they are in that state for 2 consecutive runs. That means more than 5 minutes. In such cases, the aborted pipeline would typically print the following and completes:
ERROR: node block appears to be neither running nor scheduled; will cancel if this condition persists ERROR: node block still appears to be neither running nor scheduled; cancelling
In the controller logs, the following warning is observed:
WARNING o.j.p.w.s.s.ExecutorStepExecution$AnomalousStatus#lambda$doRun$4: do not know about CpsStepContext[XX:node]:<jobFullName>#<buildNumber>
Since version 1378.v6a_3e903058a_3
this periodic task also interrupts durable task step execution (such as sh
/ bat
step) that were running on the affected node as an attempt to help terminate the stuck pipelines, that in few scenarios are not successfully interrupted by aborting the node
block.
This changes can cause pipeline runs that are executing a sh
/ bat
step on a node
that was in used by a stuck pipeline to have its execution interrupted. Those "un-related" pipeline build would show:
ERROR: also cancelling shell steps running on <agentName> [...] Agent was removed
Due to the single-shot nature of ephemeral agents, step executions inside Cloud agents are not impacted. Rather, static agents and particularly multi-executor agents can be impacted.
Resolution
This issue is resolved in CloudBees CI version 2.516.1.28665
“Also cancelling shell steps” could abort pipeline builds due to much older corrupt builds It was reported that, under certain conditions, an old pipeline build (which the user didn’t realize was still running but whose metadata was corrupt) might persistently abort new and otherwise correct builds using the same permanent agent. This resulted in the message "Also cancelling shell steps running on…". Now, this action is limited to sh steps running in the same build where the corruption was observed.
Workaround
Until you can upgrade, a workaround is to identify the Anomalous
pipeline builds and abort them manually. Look for the following logs in the controller to identify the pipeline that is stuck in thgat state:
WARNING o.j.p.w.s.s.ExecutorStepExecution$AnomalousStatus#lambda$doRun$4: do not know about CpsStepContext[XX:node]:<jobFullName>#<buildNumber>
Then go to the build page and abort it.