Issue
-
Pipeline builds are intermittently interrupted with the message
cancelling shell steps running on <node name>
. Followed byAgent was removed
.
ERROR: also canceling shell steps running on <agentName> [...] Agent was removed
Explanation
A 5-minute periodic task looks for pipeline runs stuck within a node
block (i.e. ExecutorStepExecution
) and aborts them if they are in that state for 2 consecutive runs. That means more than 5 minutes. In such cases, the aborted pipeline would typically print the following and completes:
ERROR: node block appears to be neither running nor scheduled; will cancel if this condition persists ERROR: node block still appears to be neither running nor scheduled; cancelling
In the controller logs, the following warning is observed:
WARNING o.j.p.w.s.s.ExecutorStepExecution$AnomalousStatus#lambda$doRun$4: do not know about CpsStepContext[XX:node]:<jobFullName>#<buildNumber>
Since version 1378.v6a_3e903058a_3
this periodic task also interrupts durable task step execution (such as sh
/ bat
step) that were running on the affected node. As an attempt to help terminate the stuck pipelines, that in few scenarios are not successfully interrupted by aborting the node
block.
This changes can cause pipeline runs that are executing a sh
/ bat
step on a node
that was in used by a stuck pipeline to have its execution interrupted. Those "un-related" pipeline build would show:
ERROR: also cancelling shell steps running on <agentName> [...] Agent was removed
Due to the single-shot nature of ephemeral agents, step executions inside Cloud agents are not impacted. Rather, static agents and particularly multi-executor agents can be impacted.
Resolution / Workaround
Until a fix is available, a workaround is to identify the Anomalous
pipeline builds and abort them manually. Look for the following logs in the controller to identify the pipeline that is stuck in thgat state:
WARNING o.j.p.w.s.s.ExecutorStepExecution$AnomalousStatus#lambda$doRun$4: do not know about CpsStepContext[XX:node]:<jobFullName>#<buildNumber>
Then go the build page and abort it.