Running Condor jobs with large memory requirements

Normally Condor assigns one job to each CPU on a node, dividing up the memory equally. On most of our current systems this results in 2 GB of RAM per slot. Ideally, you should structure your jobs to stay within this amount of memory; this uses the cluster efficiently.

If you job grows too large, one of two things will happen.

  • If it exceeds 2 GB, Condor may evict it, causing it to return to the queue and stay in the idle ("I") state. This is more likely to happen when there are a lot of other jobs queued.
  • If the combination of jobs on a machine exceeds the amount of available RAM and swap, the kernel out of memory killer will kill processes until memory becomes available.

Running jobs larger than 2 GB

If you have jobs that consume more than 2 GB of memory, you can tell Condor to claim an entire machine instead of one slot, so all of the system's memory is available to your job. To do this, add

+RequiresWholeMachine = True

to your submit file. (Note the plus sign, which is required. Also, note that this attribute is a custom one for our site and may not be available on other Condor clusters.) You also will also want to tell Condor not to check your job's memory use, so it won't be evicted when it grows larger than 2 GB. This is easily done by adding your own memory constraint to your job's submit file; for example:

Requirements = (Memory > 0)

Finally, you may want to specify a minimum amount of total memory for the machine. This can be done by adding a TotalMemory requirement. (Both TotalMemory and Memory are measured in megabytes. Memory is the memory available per slot, while TotalMemory is the total amount of memory for the whole machine.)

Here's an example submit script for an executable called hugejob, which requires at least 7 GB of memory to run:

universe = vanilla
executable = hugejob
getenv = true
input = hugejob.in
output = hugejob.out
error = hugejob.err
log = hugejob.log
+RequiresWholeMachine = True
Requirements = ( Memory > 0 && TotalMemory >= (7*1024) )

Note: Be careful about being too specific with TotalMemory constraints. For various reasons (memory consumed by the OS, etc.) the TotalMemory constraint will probably be stricter than you expect. For example, our 4 gigabyte nodes actually report their total memory as 3950 MB, so a constraint of (TotalMemory >= (4*1024)) will exclude them.

Interaction with other jobs

Jobs with +RequiresWholeMachine set follow the following rules:

  1. RequiresWholeMachine jobs will only start on Slot 1. Once the job is running, other slots will be marked as having "Owner" status, to prevent single-slot jobs from running in them and consuming memory. If no machines that match the job's requirements have Slot 1 available, the job will remain idle in the queue until Slot 1 opens up on a machine.
  2. If a machine with no slots taken that matches the job's requirements is available, the job will start there.
  3. If no machines are completely free, but an otherwise occupied machine has Slot 1 open, the job will start there and immediately go into the "Suspended" state. It will remain suspended until all the single-slot jobs on the machine complete, and then it will continue. (This more or less causes the job to claim "dibs" on a slot that might otherwise go to a single-slot job.) If the job remains suspended for at least two hours without running, it will become eligible for preemption and may return to the queue to wait for a new slot assignment. You can force this to happen at any time with the condor_vacate command.

I'm still tweaking these rules, so if you see any pathological behavior, or have an idea for a way to allocate slots more fairly, email linghelp@u and let me know.

-- brodbd - 09 Apr 2009

Edit | Attach | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r7 - 2009-11-24 - 17:46:32 - brodbd
 

This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions