Summary
Your build stalls/has timed out jobs and thus breaks. A cluster host displays a popup, which stalls the build. Usually, the popup is invisible or unreadable.
Solution
The simplest problem is when something goes wrong and you get a dialog box, which you would see easily on your console, but ElectricAccelerator does not catch dialog boxes and propagate them back to the user. The usual symptom of it is that commands are timing out on the cluster host.
To diagnose:
-
Turn off all hosts in the cluster but one.
-
Connect to that host’s console. If it runs Windows XP, look at the output sent to the monitor; if it runs Windows 2003, connect using the Remote Desktop protocols using "rdesktop -0" (on Linux) or "mstsc /console" (on Windows).
-
Run the build. Any popups will appear there, and you can figure out what is going wrong there.
Normally, because the agent is a service run as a named user, all of these commands would run under their own invisible Windows stations and desktops. This would make it very difficult to diagnose the above problem, so ElectricAccelerator always points everything it starts at WinSta0\Default.
This can lead to another problem: jobs failing with status -1073741502, which is 0xC0000142, STATUS_DLL_INIT_FAILED. This is a symptom of starting a program that loads user32.dll under an agent that does not have access to the system console. user32, even if it only runs silently, still wants to talk to the console as it loads. This is why all agents must run as a user with Administrator privileges; ecagentsvc.exe will add the agent user to the ACL for the system console when it starts, but every time someone logs in or out, the console is destroyed and recreated and any permission tweaks we made get lost. The console always gives access to Administrators, so that avoids the problem.
If you have an application icon but cannot open it, run Spy++. This tool allows you to inspect any window structure on the machine, which can be helpful to extract error messages from alert boxes, etc.
After you know the error, investigate what may cause it. If this application usually (especially locally) functions, a common cause is the lack of specific assemblies (.NET) on the agent machine. Make sure that the application in question can run stand alone on the agent. The next thing to try is to run the agent as the build user from the emake machine. That should rule out other potential ACL issues.