Issue
Managed controllers fail to provision or take too long to provision. Eventually, the worker is not able to provision anything and has to be replaced by a new one.
After experiencing some Managed controller provisioning issues, we can see in the syslogs entries similar with the ones shown below:
com.cloudbees.dac.castle.VolumeDeviceUtils$BadDevices ban FINE: Banned device /dev/sdj com.cloudbees.dac.castle.EbsBackend lambda$tagAndMountVolume$4 SEVERE: Error attaching volume
And
WARNING: Timed out waiting for volume vol-XXXXXXXXXX. Detaching and banning device /dev/sdj in instance i-XXXXXXXXXX
The device is banned by Castle service, and added to a list that can be located inside of a given worker in the path: /tmp/castle/baddevices/dev/
.
In order to discard that Castle is wrongly marking the volume as banned, you need to verify that the timeout is not due to a problem with the AWS API calls frequency configured in Castle.
To verify this, you should review the CloudWatch events for the period corresponding with the log messages, and if you see a list of consecutive and very close events likke the one shown below:
DetachVolume 2019-01-03T09:XX:39.000Z - Event ID xxx-yyy-dddd DetachVolume 2019-01-03T09:XX:38.000Z - Event ID xxx-aaa-dddd DetachVolume 2019-01-03T09:XX:37.000Z - Event ID xxx-ccc-ffff
Once that you have confirmed that you have CloudWatch entries similar to the ones above, the most possible cause for the issue is that Castle retry properties are not correctly set.
Resolution
This behavior can be considered normal as Castle will try to attach a given volume to the worker, and once that the operation times out the volume/device is marked as banned so that Castle can safely ignore it while provisioning new controllers/applications.
As mentioned in the Issue statement section, the most possible cause for the issue is that Castle retry properties are not correctly set. These properties can be found in the following file: .dna/servers/castle/dna.config
or in .dna/project.config
under the [castle]
section.
Please, find below the steps needed to correct the values:
-
In your CloudBees Jenkins Enterprise project, edit the DNA properties file
.dna/project.config
-
Find the
[castle]
section and thejvm_options
property. -
Add each property that you would like to set to the jvm_options property as a Java property, i.e. using the -D option. In this particular case, we would need to alter two properties:
-
com.cloudbees.dac.castle.util.AWSUtils.retryMaximumTimeSeconds
with recommended value of 30s. -
com.cloudbees.dac.castle.util.AWSUtils.retryAttempts
with a recommended value of 20 times.
e.g.:
jvm_options=-Dcom.cloudbees.dac.castle.util.AWSUtils.retryMaximumTimeSeconds=30 -Dcom.cloudbees.dac.castle.util.AWSUtils.retryAttempts=20
-
Force update to project by running the following command from your Bastion host console
cje upgrade --config-only --force
. -
Finally, reinit Castle by running
dna reinit castle
.
Workaround
If you don’t want to follow the steps above, there is a temporal workaround available that can help you overcome this kind of provisioning issue for a given worker:
-
Add a worker as described in How to extend the size of a worker.
-
Stop the applications running in the unhealthy worker.
-
Remove the unhealthy worker as described in the document above.
-
Purge the deleted worker as described in How to purge deleted workers