RAL Memory Limits

From GridPP Wiki
Jump to: navigation, search

Current situation

All jobs on the RAL batch system are run in distinct cgroups, using the following cgroup subsystems: cpu, cpuacct, memory, freezer and blkio. On most worker nodes there are no memory limits applied using cgroups. However, there are memory limits applied by HTCondor. If the resident set size exceeds the requested memory then the job is killed. The resident set size is checked regularly but jobs may be able to exceed their requested memory for some time, possibly up to around 20 minutes or so.

Current testing

On one tranche of worker nodes, corresponding to around 2000 cores, we have soft memory limits applied via cgroups. In this case the cgroup attribute memory.soft_limit_in_bytes for each job is set to the amount of memory requested by the job. Jobs are allowed to exceed this memory limit if there is free memory available on the system. Only when there is contention between other processes for physical memory will the system force physical memory into swap and push the physical memory used towards the assigned limit. In addition, for the htcondor cgroup we have memory.limit_in_bytes set to the physical memory available on the worker node, and memory.memsw.limit_in_bytes set to the sum of the physical memory and 20% of the swap (by default our worker nodes have the same amount of swap as physical memory). This limits the total amount of memory and swap used by all jobs on each worker node.

In theory the condor_starter for each job registers to have the cgroup memory controller notify it when the per-cgroup OOM fires, therefore HTCondor knows when a job has been killed by the OOM killer. Unfortunately this is not currently working with the kernel that we're running at RAL, so jobs which have been killed by the OOM killer just say that they were "killed by signal 9".