RAL Memory Limits

From GridPP Wiki
Jump to: navigation, search

Current situation

All jobs on the RAL batch system are run in distinct cgroups, using the following cgroup subsystems: cpu, cpuacct, memory, freezer and blkio. On most worker nodes there are no memory limits applied using cgroups. However, there are memory limits applied by HTCondor. If the resident set size exceeds the requested memory then the job is killed. The resident set size is checked regularly but jobs may be able to exceed their requested memory for some time, possibly up to around 20 minutes or so.

Current testing

On one tranche of worker nodes, corresponding to around 2000 cores, we have soft memory limits applied via cgroups. In this case the cgroup attribute memory.soft_limit_in_bytes for each job is set to the amount of memory requested by the job. Jobs are allowed to exceed this memory limit if there is free memory available on the system. Only when there is contention between other processes for physical memory will the system force physical memory into swap and push the physical memory used towards the assigned limit. In addition, for the htcondor cgroup we have memory.limit_in_bytes set to the physical memory available on the worker node, and memory.memsw.limit_in_bytes set to the sum of the physical memory and 20% of the swap (by default our worker nodes have the same amount of swap as physical memory). This limits the total amount of memory and swap used by all jobs on each worker node.