Pipeline run unexpectedly aborted with the message Agent was removed

2 minute readKnowledge base

Issue

  • Pipeline builds are intermittently interrupted with the message cancelling shell steps running on <node name>. Followed by Agent was removed.

ERROR: also canceling shell steps running on <agentName>
[...]
Agent was removed

Explanation

A 5-minute periodic task looks for pipeline runs stuck within a node block (i.e. ExecutorStepExecution) and aborts them if they are in that state for 2 consecutive runs. That means more than 5 minutes. In such cases, the aborted pipeline would typically print the following and completes:

ERROR: node block appears to be neither running nor scheduled; will cancel if this condition persists
ERROR: node block still appears to be neither running nor scheduled; cancelling

In the controller logs, the following warning is observed:

WARNING	o.j.p.w.s.s.ExecutorStepExecution$AnomalousStatus#lambda$doRun$4: do not know about CpsStepContext[XX:node]:<jobFullName>#<buildNumber>

Since version 1378.v6a_3e903058a_3 this periodic task also interrupts durable task step execution (such as sh / bat step) that were running on the affected node. As an attempt to help terminate the stuck pipelines, that in few scenarios are not successfully interrupted by aborting the node block.

This changes can cause pipeline runs that are executing a sh / bat step on a node that was in used by a stuck pipeline to have its execution interrupted. Those "un-related" pipeline build would show:

ERROR: also cancelling shell steps running on <agentName>
[...]
Agent was removed

Due to the single-shot nature of ephemeral agents, step executions inside Cloud agents are not impacted. Rather, static agents and particularly multi-executor agents can be impacted.

Resolution / Workaround

Until a fix is available, a workaround is to identify the Anomalous pipeline builds and abort them manually. Look for the following logs in the controller to identify the pipeline that is stuck in thgat state:

WARNING	o.j.p.w.s.s.ExecutorStepExecution$AnomalousStatus#lambda$doRun$4: do not know about CpsStepContext[XX:node]:<jobFullName>#<buildNumber>

Then go the build page and abort it.