Difference between revisions of "Enable Cgroups in HTCondor"

From GridPP Wiki
Jump to: navigation, search
(Introduction: Secion on why to use CGROUPS)
Line 1: Line 1:
 
==== Introduction ====
 
==== Introduction ====
  
blah blah blah
+
The [http://research.cs.wisc.edu/htcondor/manual/v8.4/3_12Setting_Up.html#SECTION0041214000000000000000 CGROUPS] section of the HTCondor manual is useful as background reading. In summary: there are two problems with the default way that HTCondor imposes resource limits on memory. First, limits are imposed on a  per process basis, not per job. Since jobs can have many processes, it easy for a job to blow the top off its limit. Second, the memory limit only applies to the virtual memory size, not the physical memory size, or the resident set size. The administrator would prefer to control physical memory.
 +
 
 +
The suggested solution is to use  Linux CGROUPS to apply limits to the physical memory used by the set of processes that make up the job. The changes required on the nodes to enact this are shown in the sections below. Note in particular that this config imposes "soft limits". With this in place, the job is allowed to go over the limit if there is free memory available on the system. Only when there is contention between other processes for physical memory will the system force physical memory into swap. If the job exceeds both the physical memory and swap space, the job will be killed by the Linux Out-of-Memory killer. Also, to understand what follows, you need to be aware of an important inconsistency: HTCondor measures virtual memory in kbytes, and physical memory in megabytes.
 +
 
 +
There is a related topic to cover before we go on. blah blah blah
  
 
==== Workernode configuration ====
 
==== Workernode configuration ====

Revision as of 12:34, 4 May 2016

Introduction

The CGROUPS section of the HTCondor manual is useful as background reading. In summary: there are two problems with the default way that HTCondor imposes resource limits on memory. First, limits are imposed on a per process basis, not per job. Since jobs can have many processes, it easy for a job to blow the top off its limit. Second, the memory limit only applies to the virtual memory size, not the physical memory size, or the resident set size. The administrator would prefer to control physical memory.

The suggested solution is to use Linux CGROUPS to apply limits to the physical memory used by the set of processes that make up the job. The changes required on the nodes to enact this are shown in the sections below. Note in particular that this config imposes "soft limits". With this in place, the job is allowed to go over the limit if there is free memory available on the system. Only when there is contention between other processes for physical memory will the system force physical memory into swap. If the job exceeds both the physical memory and swap space, the job will be killed by the Linux Out-of-Memory killer. Also, to understand what follows, you need to be aware of an important inconsistency: HTCondor measures virtual memory in kbytes, and physical memory in megabytes.

There is a related topic to cover before we go on. blah blah blah

Workernode configuration

Following are the steps to enable cgroups on an HTCondor WN. In this example we use node067 as a representative node.


1. ensure libcgroup package is installed, if not, yum install it.

node067:~# rpm -qa | grep cgroup
libcgroup-0.37-7.el6.x86_64

2. add a group htcondor to /etc/cgconfig.conf

node067:~# cat /etc/cgconfig.conf
mount {
      cpu     = /cgroup/cpu;
      cpuset  = /cgroup/cpuset;
      cpuacct = /cgroup/cpuacct;
      devices = /cgroup/devices;
      memory  = /cgroup/memory;
      freezer = /cgroup/freezer;
      net_cls = /cgroup/net_cls;
      blkio   = /cgroup/blkio;
}
group htcondor {
      cpu {}
      cpuacct {}
      memory {}
      freezer {}
      blkio {}
}

3. start the cgconfig daemon, a directory htcondor will be created under /cgroup/*/

node067:~#  service cgconfig start;
node067:~#  chkconfig cgconfig on
node067:~# ll -d  /cgroup/memory/htcondor/
drwxr-xr-x. 66 root root 0 Oct  9 11:58 /cgroup/memory/htcondor/

4. in the condor WN configuration , add the following lines for STARTD daemon and then restart the startd daemon:

# Enable CGROUP control
BASE_CGROUP = htcondor
# hard: job can't access more physical memory than allocated
# soft: job can access more physical memory than allocated when there are free memory
CGROUP_MEMORY_LIMIT_POLICY = soft

Then when there are jobs running on this WN, there will be a list of condor_tmp_condor_slot* directories created under /cgroup/*/htcondor/:

node067:~# ll -d  /cgroup/memory/htcondor/condor_tmp_condor_slot1_*
drwxr-xr-x. 2 root root 0 Oct  8 09:22 /cgroup/memory/htcondor/condor_tmp_condor_slot1_10@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 06:26 /cgroup/memory/htcondor/condor_tmp_condor_slot1_11@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 05:02 /cgroup/memory/htcondor/condor_tmp_condor_slot1_12@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 05:18 /cgroup/memory/htcondor/condor_tmp_condor_slot1_13@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 10:42 /cgroup/memory/htcondor/condor_tmp_condor_slot1_14@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  8 12:32 /cgroup/memory/htcondor/condor_tmp_condor_slot1_15@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 06:52 /cgroup/memory/htcondor/condor_tmp_condor_slot1_16@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 08:43 /cgroup/memory/htcondor/condor_tmp_condor_slot1_17@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 06:14 /cgroup/memory/htcondor/condor_tmp_condor_slot1_18@node067.beowulf.cluster

From these directories you can retrieve the recorded information.

More information can be found in the condor manual:


Glasgow Scheduling Modifications

To improve scheduling on the Glasgow cluster we statically assign a memory amount based on the type of job in the system. This allows fine grained control over our Memory overcommit and allows us to restrict the number of jobs we run on our memory constrained systems. There are other ways to do this but this allows us to play with the parameters to see what works.

In the submit-condor-job found in /usr/share/arc/ we alter the following section to look like the below:

############################################################## 
# Requested memory (mb)
##############################################################
set_req_mem
if [ ! -z "$joboption_memory" ] ; then
 memory_bytes=2000*1024
 memory_req=2000
 # HTCondor needs to know the total memory for the job, not memory per core
 if [ ! -z $joboption_count ] && [ $joboption_count -gt 1 ] ; then
    memory_bytes=$(( $joboption_count * 2000 * 1024 ))
    memory_req=$(( $joboption_count * 2000 ))
 fi
 memory_bytes=$(( $memory_bytes + 4000 * 1024  ))  # +4GB extra as hard limit
 echo "request_memory=$memory_req" >> $LRMS_JOB_DESCRIPT
 echo "+JobMemoryLimit=$memory_bytes" >> $LRMS_JOB_DESCRIPT
 REMOVE="${REMOVE} || ResidentSetSize > JobMemoryLimit"
fi

RAL Modifications

In the submit-condor-job found in /usr/share/arc/ we comment out the line:

 REMOVE="${REMOVE} || ResidentSetSize > JobMemoryLimit"

to ensure that jobs are not killed if the memory usage exceeds the requested memory. In order to put a hard memory limit on jobs we include the following in SYSTEM_PERIODIC_REMOVE:

 ResidentSetSize > 3000*RequestMemory