Difference between revisions of "RAL Memory Limits"

From GridPP Wiki
Jump to: navigation, search
 
(2 intermediate revisions by one user not shown)
Line 1: Line 1:
 
== Current situation ==
 
== Current situation ==
All jobs on the RAL batch system are run in distinct cgroups, using the following cgroup subsystems: cpu, cpuacct, memory, freezer and blkio. On most worker nodes there are no memory limits applied using cgroups. However, there are memory limits applied by HTCondor. If the resident set size exceeds the requested memory then the job is killed. The resident set size is checked regularly but jobs may be able to exceed their requested memory for some time, possibly up to around 20 minutes or so.
+
All jobs on the RAL batch system are run in distinct cgroups, using the following cgroup subsystems: cpu, cpuacct, memory, freezer and blkio. No memory limits are applied by HTCondor itself, instead we have soft memory limits applied via cgroups. In this case the cgroup attribute ''memory.soft_limit_in_bytes'' for each job is set to the amount of memory requested by the job. Jobs are allowed to exceed this memory limit if there is free memory available on the system.  Only when there is contention between other processes for physical memory will the system force physical memory into swap and push the physical memory used towards the assigned limit. In addition, for the htcondor cgroup we have ''memory.limit_in_bytes'' set to the physical memory available on the worker node, and ''memory.memsw.limit_in_bytes'' set to the sum of the physical memory and 20% of the swap (by default our worker nodes have the same amount of swap as physical memory). This limits the total amount of memory and swap used by all jobs on each worker node. The condor_starter for each job registers to have the cgroup memory controller notify it when the per-cgroup OOM fires, therefore HTCondor knows when a job has been killed by the OOM killer.  
 
+
== Current testing ==
+
On one tranche of worker nodes, corresponding to around 2000 cores, we have soft memory limits applied via cgroups. In this case the cgroup attribute ''memory.soft_limit_in_bytes'' for each job is set to the amount of memory requested by the job. Jobs are allowed to exceed this memory limit if there is free memory available on the system.  Only when there is contention between other processes for physical memory will the system force physical memory into swap and push the physical memory used towards the assigned limit. In addition, for the htcondor cgroup we have ''memory.limit_in_bytes'' set to the physical memory available on the worker node, and ''memory.memsw.limit_in_bytes'' set to the sum of the physical memory and 20% of the swap (by default our worker nodes have the same amount of swap as physical memory). This limits the total amount of memory and swap used by all jobs on each worker node.
+
 
+
In theory the condor_starter for each job registers to have the cgroup memory controller notify it when the per-cgroup OOM fires, therefore HTCondor knows when a job has been killed by the OOM killer. Unfortunately this is not currently working with the kernel that we're running at RAL, so jobs which have been killed by the OOM killer just say that they were "killed by signal 9".
+
 
+
Eventually the memory limits applied via HTCondor, as described in the first section above, will be removed, so that memory will be entirely managed by cgroups.
+
  
 
For more information about cgroups, see for example https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html
 
For more information about cgroups, see for example https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html

Latest revision as of 10:28, 11 February 2015

Current situation

All jobs on the RAL batch system are run in distinct cgroups, using the following cgroup subsystems: cpu, cpuacct, memory, freezer and blkio. No memory limits are applied by HTCondor itself, instead we have soft memory limits applied via cgroups. In this case the cgroup attribute memory.soft_limit_in_bytes for each job is set to the amount of memory requested by the job. Jobs are allowed to exceed this memory limit if there is free memory available on the system. Only when there is contention between other processes for physical memory will the system force physical memory into swap and push the physical memory used towards the assigned limit. In addition, for the htcondor cgroup we have memory.limit_in_bytes set to the physical memory available on the worker node, and memory.memsw.limit_in_bytes set to the sum of the physical memory and 20% of the swap (by default our worker nodes have the same amount of swap as physical memory). This limits the total amount of memory and swap used by all jobs on each worker node. The condor_starter for each job registers to have the cgroup memory controller notify it when the per-cgroup OOM fires, therefore HTCondor knows when a job has been killed by the OOM killer.

For more information about cgroups, see for example https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html