KBEA-00031 - Configuring the Windows stalled job killer

Article ID:360032828352
2 minute readKnowledge base

Summary

You need to adjust the timeout used by an agent to detect stalled commands on the agent because you see messages like:

Command "XXX" was not making progress so it was automatically aborted.

Solution

Run the following commands (replace with the name of your cluster manager):

cmtool --cm= login
cmtool --cm= runAgentCmd "agentexec timeout {{.* 120000 {disk}}}"
cmtool --cm= runAgentCmd "agentexec timeout {{.* 120000 {cpu disk}}}"

To permanently change the settings, create (or edit) c:\ECloud\i686_win32\bin\runagent.local on each agent and add the following text, adjusting for the desired timeout (for example 7 minutes):

set commandTimeout\
 {\
 { {.*bin[/\\](ba)?sh.exe.*} 420000 {disk} }\
 { {.*} 420000 {cpu disk} }\
 }

After making this change it is recommended to reboot the machine.

To see what the current timeout setting is, run the following command (replace with the name of your cluster manager):

cmtool --cm= login
cmtool --cm= runAgentCmd "agentexec timeout"

NOTE:

  • runagent.local is located in the same location for both 32-bit and 64-bit systems.

  • The default timeout on a Windows agent is 1 minute (60000 ms) i.e., jobs on the agent will be timed out after 1 minute without any CPU or disk activity.

  • The timeout of a particular process can be configured by setting different parameters for that process name e.g., "sh.exe" or "bash.exe". The format of the tuples is "regexp milliseconds attributes", where "regexp" is a regular expression used to match against the process name; 'milliseconds" is the number of milliseconds after which a timeout is declared; and "attributes" is which things to check for activity, some combination of "disk" and "cpu".

Applies to

  • Product versions: All

  • OS versions: Windows

Applies to

  • Product versions: All

  • OS versions: Windows