Troubleshooting Condor Job Problems

General suggestions

  • Make sure you're giving the full path to the executable, in your submit file. (Unless the executable is in the same directory you're running condor_submit in.)
  • Make sure the directory your logfile is in exists.
  • Make sure your input and output files are correct.
  • If you're running a script as your executable, make sure its execute bit is set and the Shebang line is correct. Try running it from the command line to make sure it works.
  • Check the job's logfile for useful error messages.

Guidelines for specific situations

Job immediately goes into the Held state

Often this means Condor is having trouble executing the job. Try using condor_q -long and examining the HoldReason attribute. For example: condor_q -global -long 13281 | grep 'HoldReason'.

One particularly common error that some users may find puzzling is "Exec format error." This usually means you've forgotten to include the "shebang" line in a script. While a shebang line is not necessarily required when running a script interactively, Condor needs to see one so it knows what shell or interpreter to use to run the script.

condor_submit fails with "no such directory"

Sometimes, usually when working on group projects outside your home directory, condor_submit will fail with an error like

ERROR: No such directory: /projects/foo/bar/biz

This happens when the directory is not owned by you, regardless of whether you have access. (This may be a bug in condor_submit.) If you encounter this, either move the submit script and log file to a directory you own, or contact linghelp@uw to have the directory chown'd to you.

Job runs for a while, then bogs down or gets killed

This often means the job has exceeded its memory request without Condor noticing, and has gotten so large that the machine it's running on has run out of RAM. Check the SIZE column of condor_q and compare to what you've specified on your request_memory line. (The default if you don't specify is 1024 MB.) See BigMemoryCondor and the section below for more information.

Job runs for a while, then goes idle (or gets evicted)

This is usually because your job is consuming more memory than you requested for it on the request_memory line of your submit decription file. You can verify this by looking at the SIZE column of condor_q and comparing it to your memory request. (If you didn't include a request_memory line, the default is 1024 MB.) See BigMemoryCondor for more information on request_memory and how to use it.

If you don't want to have to re-submit the job, you can use condor_qedit to change its memory requrements on the fly. The format is condor_qedit <jobid> RequestMemory <memory in MB>. For example, to request 5 GB of RAM:

condor_qedit 123456 RequestMemory "5*1024"

The quote marks prevent the shell from interpreting the multiplication as a file wildcard.

Job sometimes works and sometimes fails

Sometimes this can be caused by problems with a particular node -- either a misconfiguration, or a temporary problem such as memory pressure from jobs with incorrect memory specifications. Check your job log file and see if the IP address on the "Job executing on host:" line is the same for all the failed jobs. If so, you should email linghelp so the situation can be fixed. As a temporary fix, you can avoid the problematic node by excluding it from your requirements, e.g.:

Requirements = ( Machine != "" )
Topic revision: r14 - 2016-05-03 - 22:12:47 - brodbd

This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions