Issue
We observe that our cluster is functional and operative, but on the other hand we see in Marathon UI that castle
service is continuously deploying and marked as unhealthy. This happens for every castle process in the environment.
Environment
-
CloudBees Jenkins Enterprise Anywhere
Resolution
When this is happening, this is usually related to a communication problem between the Marathon service running in the controllers and the castle processes running in the controller workers. In order to confirm this hypothesis, we should be able to find in the controller syslogs messages like the ones shown below:
marathon[XXX]: [2019-x-xx:xx] INFO Received health result for app [/jce/castle] version [2019-x-xx:xx]: [Unhealthy(jce_castle.xxx-x-xx-xx-x,2019-x-xx:xx,AskTimeoutException: Ask timed out on [Actor[akka://marathon/user/IO-HTTP#xx]] after [20000 ms],2019-x-xx:xx)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-34)
The health check is not successful and that causes Marathon to kill the container hosting the castle
service and schedule a new one.
To solve this issue, you should ensure that the ports needed for the health check to work as expected are opened correctly, this way controllers should be able to reach workers on port 31080 which is the port where castle
service will be accesible.
Tested product/plugin versions
-
CloudBees Jenkins Enterprise 1.11.X Anywhere