Difference between revisions of "Enable Cgroups in HTCondor"

From GridPP Wiki
Jump to: navigation, search
(RAL Modifications)
(Liverpool Modifications)
Line 111: Line 111:
 
====Liverpool Modifications====
 
====Liverpool Modifications====
  
This is the scheme used at Liverpool to supress the activity of PERIODIC_REMOVE, so it only happens in extreme conditions. We put in this cron job to execute a script every five minutes:
+
This is the scheme use at Liverpool to supress the activity of PERIODIC_REMOVE, so it only happens in extreme conditions. To avoid patching the interface, we use a script to boost the memory limit. We put in this cron job to execute the script every five minutes:
  
 
  # crontab -l
 
  # crontab -l

Revision as of 13:21, 4 May 2016

Introduction

The CGROUPS section of the HTCondor manual is useful as background reading. In summary: there are two problems with the default way that HTCondor imposes resource limits on memory. First, limits are imposed on a per process basis, not per job. Since jobs can have many processes, it easy for a job to blow the top off its limit. Second, the memory limit only applies to the virtual memory size, not the physical memory size or the resident set size. The administrator would much prefer to control physical memory. The suggested solution is to use Linux CGROUPS to apply limits to the physical memory used by the set of processes that make up the job. The changes required on the nodes to enact this are shown in the sections below.

Note in particular that this config imposes "soft limits". With this in place, the job is allowed to go over the limit if there is free memory available on the system. Only when there is contention between other processes for physical memory will the system force physical memory into swap. If the job exceeds both its the physical memory and swap space allotment, the job will be killed by the Linux Out-of-Memory killer. Also, to understand what follows, you need to be aware of an inconsistency: HTCondor measures virtual memory in KB, and physical memory in MB.

And there is also a related topic to cover before we can go on. HTCondor has its own mechanism for killing jobs that grow too big, called PERIODIC_REMOVE. This is an expression in the job's ClassAd that is evaluated every so often. It works independently of CGROUPS mechanism above. When a job arrives in the (e.g.) ARC CE, the interface code to HTCondor populates PERIODIC_REMOVE with "ResidentSetSize > JobMemoryLimit". The JobMemoryLimit is determined precisely by the memory (i.e. request_memory) option in the JDL times the number of cores, and the ResidentSetSize is measured periodically by HTCondor. If ResidentSetSize gets too big, PERIODIC_REMOVE comes true, and the job is killed. The important thing to know is that this arrangement kills ATLAS jobs unnecessarily and it is necessary to suppress it. Various schemes to do this are discussed below.

Workernode configuration

Following are the steps to enable cgroups on an HTCondor WN. In this example we use node067 as a representative node.


1. ensure libcgroup package is installed, if not, yum install it.

node067:~# rpm -qa | grep cgroup
libcgroup-0.37-7.el6.x86_64

2. add a group htcondor to /etc/cgconfig.conf

node067:~# cat /etc/cgconfig.conf
mount {
      cpu     = /cgroup/cpu;
      cpuset  = /cgroup/cpuset;
      cpuacct = /cgroup/cpuacct;
      devices = /cgroup/devices;
      memory  = /cgroup/memory;
      freezer = /cgroup/freezer;
      net_cls = /cgroup/net_cls;
      blkio   = /cgroup/blkio;
}
group htcondor {
      cpu {}
      cpuacct {}
      memory {}
      freezer {}
      blkio {}
}

3. start the cgconfig daemon, a directory htcondor will be created under /cgroup/*/

node067:~#  service cgconfig start;
node067:~#  chkconfig cgconfig on
node067:~# ll -d  /cgroup/memory/htcondor/
drwxr-xr-x. 66 root root 0 Oct  9 11:58 /cgroup/memory/htcondor/

4. in the condor WN configuration , add the following lines for STARTD daemon and then restart the startd daemon:

# Enable CGROUP control
BASE_CGROUP = htcondor
# hard: job can't access more physical memory than allocated
# soft: job can access more physical memory than allocated when there are free memory
CGROUP_MEMORY_LIMIT_POLICY = soft

Then when there are jobs running on this WN, there will be a list of condor_tmp_condor_slot* directories created under /cgroup/*/htcondor/:

node067:~# ll -d  /cgroup/memory/htcondor/condor_tmp_condor_slot1_*
drwxr-xr-x. 2 root root 0 Oct  8 09:22 /cgroup/memory/htcondor/condor_tmp_condor_slot1_10@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 06:26 /cgroup/memory/htcondor/condor_tmp_condor_slot1_11@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 05:02 /cgroup/memory/htcondor/condor_tmp_condor_slot1_12@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 05:18 /cgroup/memory/htcondor/condor_tmp_condor_slot1_13@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 10:42 /cgroup/memory/htcondor/condor_tmp_condor_slot1_14@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  8 12:32 /cgroup/memory/htcondor/condor_tmp_condor_slot1_15@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 06:52 /cgroup/memory/htcondor/condor_tmp_condor_slot1_16@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 08:43 /cgroup/memory/htcondor/condor_tmp_condor_slot1_17@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 06:14 /cgroup/memory/htcondor/condor_tmp_condor_slot1_18@node067.beowulf.cluster

From these directories you can retrieve the recorded information.

More information can be found in the condor manual:


Glasgow Scheduling Modifications

To improve scheduling on the Glasgow cluster we statically assign a memory amount based on the type of job in the system. This allows fine grained control over our Memory overcommit and allows us to restrict the number of jobs we run on our memory constrained systems. There are other ways to do this but this allows us to play with the parameters to see what works.

In the submit-condor-job found in /usr/share/arc/ we alter the following section to look like the below:

############################################################## 
# Requested memory (mb)
##############################################################
set_req_mem
if [ ! -z "$joboption_memory" ] ; then
 memory_bytes=2000*1024
 memory_req=2000
 # HTCondor needs to know the total memory for the job, not memory per core
 if [ ! -z $joboption_count ] && [ $joboption_count -gt 1 ] ; then
    memory_bytes=$(( $joboption_count * 2000 * 1024 ))
    memory_req=$(( $joboption_count * 2000 ))
 fi
 memory_bytes=$(( $memory_bytes + 4000 * 1024  ))  # +4GB extra as hard limit
 echo "request_memory=$memory_req" >> $LRMS_JOB_DESCRIPT
 echo "+JobMemoryLimit=$memory_bytes" >> $LRMS_JOB_DESCRIPT
 REMOVE="${REMOVE} || ResidentSetSize > JobMemoryLimit"
fi

RAL Modifications

This is the scheme used at RAL to supress the activity of PERIODIC_REMOVE, so it only happens in extreme conditions. It gets rid of the PERIODIC_REMOVE memory constraint, and uses the SYSTEM_PERIODIC_REMOVE (which applies to all jobs) instead to implement a wide margin.

In the submit-condor-job found in /usr/share/arc/ we comment out the line:

 REMOVE="${REMOVE} || ResidentSetSize > JobMemoryLimit"

to ensure that jobs are not killed if the memory usage exceeds the requested memory. In order to put a hard memory limit on jobs we include the following in SYSTEM_PERIODIC_REMOVE:

 ResidentSetSize > 3000*RequestMemory

Liverpool Modifications

This is the scheme use at Liverpool to supress the activity of PERIODIC_REMOVE, so it only happens in extreme conditions. To avoid patching the interface, we use a script to boost the memory limit. We put in this cron job to execute the script every five minutes:

# crontab -l
*/5 * * * * /root/scripts/boostJobMemoryLimit.pl 3.0 >> /root/scripts/boostJobMemoryLimit.log

The script, below, increases the JobMemoryLimit of jobs.

#!/usr/bin/perl
use strict;

if ($#ARGV != 0) {
  die("Give a boost factor , e.g. 2.0\n");
}

if (! -f "/root/scripts/boosted.log") {
  system("touch /root/scripts/boosted.log");
}

my %boosted;
open(BOOSTED,"/root/scripts/boosted.log") or die("Cannot open boosted file\n");
while(<BOOSTED>) {
  my $j = $_;
  chomp($j);
  $boosted{$j} = 1;
}
close(BOOSTED);

my $boostFactor = shift;
my @jobs;
open(JOBS,"condor_q |") or die("Cannot get job list\n");
while(<JOBS>) {
  my $line = $_;
  if ($line =~ /^(\d+\.\d+)/) {
    my $j = $1;
    if (! defined($boosted{$j})) {
      push (@jobs,$j);
    }
  }
}
close(JOBS);
open(BOOSTED,">>/root/scripts/boosted.log") or die("Cannot append to the boosted.log file\n");
foreach my $j (@jobs) {
  open(ATTRIBS,"condor_q -long $j|") or die("Could not get details for job $j\n");
  while (<ATTRIBS>) {
    my $attrib = $_;
    if ($attrib =~ /(.*) = (.*)/) {
      my $aName = $1;
      my $aVal = $2;
      if ($aName eq 'JobMemoryLimit') {
        my $newJobMemoryLimit = int($aVal * $boostFactor);
        print("Boosting JobMemoryLimit for job  $j from $aVal to $newJobMemoryLimit\n");
        open(QEDIT,"condor_qedit $j JobMemoryLimit $newJobMemoryLimit|") or die("Could not boost $j\n");
        my $response = <QEDIT>;
        if ($response !~ /Set attribute.*JobMemoryLimit/) {
          die ("Failed when boosting  $j, message was $response\n");
        }
        close(QEDIT);
        print BOOSTED "$j\n";
      }
    }
  }
  close(ATTRIBS);
}
close(BOOSTED);