Description
The eMake process failed to notify the Cluster Manager before it timed out. eMake stops at that point.
Reasons
There are no heartbeats from Electric Make (eMake) to the Cluster Manager. Possible reasons:
-
Network problems, where either the Cluster Manager or the eMake machine was unable to see the other machine.
-
The eMake heartbeat thread (which is the AgentManager thread) is excessively slow. One particular reason can be excessive logging that causes the interval between heartbeats to exceed the default duration of 3 minutes.
-
At times the Cluster Manager can run out of file descriptors due to a very small number of system level file descriptors (the default was increased). Another reason for running out of file descriptors could be the LSF request queue being stalled by the LSF server, causing the Cluster Manager to stack up more and more requests (fixed in an earlier release).
Fixes
-
Increase the number of open file descriptors (it should be in the many thousands). The procedure varies for each operating system.
-
Increase the Cluster Manager’s timeout duration (default is 3 minutes). Be careful not to make the timeout too long because potentially dead builds may continue to hold on to agents.
Follow these steps:
-
Go to the <ECloud install>/<arch>/conf/ directory on the Cluster Manager and edit the accelerator.properties file. . Increase the number for
EMAKE_HEARTBEAT_TIMEOUT=
. . Save the accelerator.properties file. . Restart the Cluster Manager.
-