Enable Cgroups in HTCondor

From GridPP Wiki
Jump to: navigation, search

Following are the steps to enable cgroups on an HTCondor WN. In this example we use node067 as a representative node:

1. ensure libcgroup package is installed, if not, yum install it.

node067:~# rpm -qa | grep cgroup
libcgroup-0.37-7.el6.x86_64

2. add a group htcondor to /etc/cgconfig.conf

node067:~# cat /etc/cgconfig.conf
mount {
      cpu     = /cgroup/cpu;
      cpuset  = /cgroup/cpuset;
      cpuacct = /cgroup/cpuacct;
      devices = /cgroup/devices;
      memory  = /cgroup/memory;
      freezer = /cgroup/freezer;
      net_cls = /cgroup/net_cls;
      blkio   = /cgroup/blkio;
}
group htcondor {
      cpu {}
      cpuacct {}
      memory {}
      freezer {}
      blkio {}
}

If you want to set some memory restriction for the cgroup, add the following in memory{}, The following example ensures that the maximum physical memory can be used by the cgroup is 132122908K; and the maximum (physical memory + swap) can be used by the cgroup is 158966452K. This will protect the machine's swap space from getting overloaded.

memory {
          memory.use_hierarchy="1";
          memory.limit_in_bytes=132122908K;
          memory.memsw.limit_in_bytes=158966452K;
       }


3. start the cgconfig daemon, a directory htcondor will be created under /cgroup/*/

node067:~#  service cgconfig start;
node067:~#  chkconfig cgconfig on
node067:~# ll -d  /cgroup/memory/htcondor/
drwxr-xr-x. 66 root root 0 Oct  9 11:58 /cgroup/memory/htcondor/

4. in the condor WN configuration , add the following lines for STARTD daemon and then restart the startd daemon:

# Enable CGROUP control
BASE_CGROUP = htcondor
# hard: job can't access more physical memory than allocated
# soft: job can access more physical memory than allocated when there are free memory
CGROUP_MEMORY_LIMIT_POLICY = soft

Then when there are jobs running on this WN, there will be a list of condor_tmp_condor_slot* directories created under /cgroup/*/htcondor/:

node067:~# ll -d  /cgroup/memory/htcondor/condor_tmp_condor_slot1_*
drwxr-xr-x. 2 root root 0 Oct  8 09:22 /cgroup/memory/htcondor/condor_tmp_condor_slot1_10@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 06:26 /cgroup/memory/htcondor/condor_tmp_condor_slot1_11@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 05:02 /cgroup/memory/htcondor/condor_tmp_condor_slot1_12@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 05:18 /cgroup/memory/htcondor/condor_tmp_condor_slot1_13@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 10:42 /cgroup/memory/htcondor/condor_tmp_condor_slot1_14@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  8 12:32 /cgroup/memory/htcondor/condor_tmp_condor_slot1_15@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 06:52 /cgroup/memory/htcondor/condor_tmp_condor_slot1_16@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 08:43 /cgroup/memory/htcondor/condor_tmp_condor_slot1_17@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 06:14 /cgroup/memory/htcondor/condor_tmp_condor_slot1_18@node067.beowulf.cluster

From these directories you can retrieve the recorded information.


More information can be found in the condor manual: