Difference between revisions of "Enable Cgroups in HTCondor"

From GridPP Wiki
Jump to: navigation, search
(CentOS7)
 
Line 258: Line 258:
 
[[Category:Multicore]]
 
[[Category:Multicore]]
 
[[Category:HTcondor]]
 
[[Category:HTcondor]]
 +
[[Category:arcce]]

Latest revision as of 13:27, 27 October 2017

Introduction

The CGROUPS section of the HTCondor manual is useful as background reading. In summary: there are two problems with the default way that HTCondor imposes resource limits on memory. First, limits are imposed on a per process basis, not per job. Since jobs can have many processes, it is easy for a job to use more memory than the admin would wish. Second, the memory limit only applies to the virtual memory size, not the physical memory size or the resident set size. The main point of Virtual Memory is that a job could use a huge amount of it, more much than would be possible with the physical RAM. The downside is that a ponderous amount of VM slows the system down immensely, due to the use of swap disk. So the administrator would much prefer to control physical memory. The suggested solution is to use Linux CGROUPS to apply limits to the physical memory used by the set of processes that make up the job. The changes required on the nodes to enact this are shown in the sections below.

Note in particular that this config imposes "soft limits". With this in place, the job is allowed to go over the limit if there is free memory available on the system. Only when there is contention between other jobs for physical memory will the system force physical memory into swap. If the job exceeds both its the physical memory and swap space allotment, the job will be killed by the Linux Out-of-Memory killer. Also, to understand what follows, you need to be aware of an inconsistency: HTCondor measures virtual memory in KB, and physical memory in MB.

Note that there is also separate but related topic to cover to do with Nordugrid ARC integration with HTCondor. If you use that setup, this might apply. HTCondor has its own mechanism for killing jobs that grow too big, called PERIODIC_REMOVE. This is an expression in the job's ClassAd that is evaluated every so often. It works independently of CGROUPS mechanism above. When a job arrives in the ARC CE, the interface code to HTCondor populates PERIODIC_REMOVE with "ResidentSetSize > JobMemoryLimit". The JobMemoryLimit is determined precisely by the memory (i.e. request_memory) option in the JDL times the number of cores, and the ResidentSetSize is measured periodically by HTCondor. If ResidentSetSize gets too big, PERIODIC_REMOVE comes true, and the job is killed. The important thing to know is that this arrangement kills ATLAS jobs unnecessarily because the limits are too tight. It is necessary to suppress it. Various schemes to do this at Glasgow, RAL, Liverpool and GRIF/IRFU are discussed at the end of this document.

Workernode configuration

CentOS7

in CentOS7 cgroups is automatically setup and all the systemd services are enabled. So all that remains to do is to enable it in htcondor by setting in the WN configuration

# Enable CGROUP control
BASE_CGROUP = /system.slice/condor.service
CGROUP_MEMORY_LIMIT_POLICY = soft

with these settings condor (at least in version 8.6.6) sets

soft limit = the requested memory

memsw limit = the machine total memory i.e. RAM+swap

the jobs resources will appear in

/sys/fs/cgroup/{cpu,cpuacct/memory/...}/system.slice/condor.service/condor_var_lib_condor_execute_slot1_*\@WN.FQDN.HERE/

unless you are using docker universe and then you have to look for docker entries. You can find some discussion about it here.

as described here the RSS and swap value appear in the memory.stat file. There is also a system command line tool (described here) which reports the sum RSS+swap+cache value. cache files are the first to go when the kernel reclaims the memory.

systemd-cgtop

More info on the interection between cgroups and condor can be found here.

SL6

Following are the steps to enable cgroups on an SL6 HTCondor WN. In this example we use node067 as a representative node.

1. ensure libcgroup package is installed, if not, yum install it.
node067:~# rpm -qa | grep cgroup
libcgroup-0.37-7.el6.x86_64
2. add a group htcondor to /etc/cgconfig.conf
node067:~# cat /etc/cgconfig.conf
mount {
      cpu     = /cgroup/cpu;
      cpuset  = /cgroup/cpuset;
      cpuacct = /cgroup/cpuacct;
      devices = /cgroup/devices;
      memory  = /cgroup/memory;
      freezer = /cgroup/freezer;
      net_cls = /cgroup/net_cls;
      blkio   = /cgroup/blkio;
}
group htcondor {
      cpu {}
      cpuacct {}
      memory {}
      freezer {}
      blkio {}
}
3. start the cgconfig daemon, a directory htcondor will be created under /cgroup/*/
node067:~#  service cgconfig start;
node067:~#  chkconfig cgconfig on
node067:~# ll -d  /cgroup/memory/htcondor/
drwxr-xr-x. 66 root root 0 Oct  9 11:58 /cgroup/memory/htcondor/
4. in the condor WN configuration , add the following lines for STARTD daemon and then restart the startd daemon:
# Enable CGROUP control
BASE_CGROUP = htcondor
# hard: job can't access more physical memory than allocated
# soft: job can access more physical memory than allocated when there are free memory
CGROUP_MEMORY_LIMIT_POLICY = soft

Then when there are jobs running on this WN, there will be a list of condor_tmp_condor_slot* directories created under /cgroup/*/htcondor/:

node067:~# ll -d  /cgroup/memory/htcondor/condor_tmp_condor_slot1_*
drwxr-xr-x. 2 root root 0 Oct  8 09:22 /cgroup/memory/htcondor/condor_tmp_condor_slot1_10@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 06:26 /cgroup/memory/htcondor/condor_tmp_condor_slot1_11@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 05:02 /cgroup/memory/htcondor/condor_tmp_condor_slot1_12@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 05:18 /cgroup/memory/htcondor/condor_tmp_condor_slot1_13@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 10:42 /cgroup/memory/htcondor/condor_tmp_condor_slot1_14@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  8 12:32 /cgroup/memory/htcondor/condor_tmp_condor_slot1_15@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 06:52 /cgroup/memory/htcondor/condor_tmp_condor_slot1_16@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 08:43 /cgroup/memory/htcondor/condor_tmp_condor_slot1_17@node067.beowulf.cluster
drwxr-xr-x. 2 root root 0 Oct  9 06:14 /cgroup/memory/htcondor/condor_tmp_condor_slot1_18@node067.beowulf.cluster

From these directories you can retrieve the recorded information.

More information can be found in the condor manual:

Glasgow Scheduling Modifications

To improve scheduling on the Glasgow cluster we statically assign a memory amount based on the type of job in the system. This allows fine grained control over our Memory overcommit and allows us to restrict the number of jobs we run on our memory constrained systems. There are other ways to do this but this allows us to play with the parameters to see what works.

In the submit-condor-job found in /usr/share/arc/ we alter the following section to look like the below:

############################################################## 
# Requested memory (mb)
##############################################################
set_req_mem
if [ ! -z "$joboption_memory" ] ; then
 memory_bytes=2000*1024
 memory_req=2000
 # HTCondor needs to know the total memory for the job, not memory per core
 if [ ! -z $joboption_count ] && [ $joboption_count -gt 1 ] ; then
    memory_bytes=$(( $joboption_count * 2000 * 1024 ))
    memory_req=$(( $joboption_count * 2000 ))
 fi
 memory_bytes=$(( $memory_bytes + 4000 * 1024  ))  # +4GB extra as hard limit
 echo "request_memory=$memory_req" >> $LRMS_JOB_DESCRIPT
 echo "+JobMemoryLimit=$memory_bytes" >> $LRMS_JOB_DESCRIPT
 REMOVE="${REMOVE} || ResidentSetSize > JobMemoryLimit"
fi

RAL Modifications

This is the scheme used at RAL to supress the activity of PERIODIC_REMOVE, so it only happens in extreme conditions. It gets rid of the PERIODIC_REMOVE memory constraint, and uses the SYSTEM_PERIODIC_REMOVE (which applies to all jobs) instead to implement a wide margin.

In the submit-condor-job found in /usr/share/arc/ we comment out the line:

 REMOVE="${REMOVE} || ResidentSetSize > JobMemoryLimit"

to ensure that jobs are not killed if the memory usage exceeds the requested memory. In order to put a hard memory limit on jobs we include the following in SYSTEM_PERIODIC_REMOVE:

 ResidentSetSize > 3000*RequestMemory

Liverpool Modifications

This is the scheme we use at Liverpool to supress the activity of PERIODIC_REMOVE, so it only happens in extreme conditions. To avoid patching the interface, we use a script to boost the memory limit. We put in this cron job to execute the script every five minutes:

# crontab -l
*/5 * * * * /root/scripts/boostJobMemoryLimit.pl 3.0 >> /root/scripts/boostJobMemoryLimit.log

The script, below, increases the JobMemoryLimit of jobs.

#!/usr/bin/perl
use strict;

if ($#ARGV != 0) {
  die("Give a boost factor , e.g. 2.0\n");
}

if (! -f "/root/scripts/boosted.log") {
  system("touch /root/scripts/boosted.log");
}

my %boosted;
open(BOOSTED,"/root/scripts/boosted.log") or die("Cannot open boosted file\n");
while(<BOOSTED>) {
  my $j = $_;
  chomp($j);
  $boosted{$j} = 1;
}
close(BOOSTED);

my $boostFactor = shift;
my @jobs;
open(JOBS,"condor_q |") or die("Cannot get job list\n");
while(<JOBS>) {
  my $line = $_;
  if ($line =~ /^(\d+\.\d+)/) {
    my $j = $1;
    if (! defined($boosted{$j})) {
      push (@jobs,$j);
    }
  }
}
close(JOBS);
open(BOOSTED,">>/root/scripts/boosted.log") or die("Cannot append to the boosted.log file\n");
foreach my $j (@jobs) {
  open(ATTRIBS,"condor_q -long $j|") or die("Could not get details for job $j\n");
  while (<ATTRIBS>) {
    my $attrib = $_;
    if ($attrib =~ /(.*) = (.*)/) {
      my $aName = $1;
      my $aVal = $2;
      if ($aName eq 'JobMemoryLimit') {
        my $newJobMemoryLimit = int($aVal * $boostFactor);
        print("Boosting JobMemoryLimit for job  $j from $aVal to $newJobMemoryLimit\n");
        open(QEDIT,"condor_qedit $j JobMemoryLimit $newJobMemoryLimit|") or die("Could not boost $j\n");
        my $response = <QEDIT>;
        if ($response !~ /Set attribute.*JobMemoryLimit/) {
          die ("Failed when boosting  $j, message was $response\n");
        }
        close(QEDIT);
        print BOOSTED "$j\n";
      }
    }
  }
  close(ATTRIBS);
}
close(BOOSTED);


GRIF/IRFU Modifications

In addition to RALs modification, we recently made sure that atlas jobs aren't auomatically killed/held when they reach their memory requirements when cgroups start malfunctionning. The reason for this is that atlas mcore jobs are submitted with exactly 2GB per core (hence : usually 16GB required), but sometimes condor cgroups aren't setup properly and the hard memory limit is set in the best cases to 16GB, thus disallowing the jobs from swapping and putting them on hold. This then causes lost heartbeats errors for atlas.

Our new ARC modification now looks like this :

 ##############################################################
 # Requested memory (mb)
 ##############################################################
 set_req_mem
 if [ ! -z "$joboption_memory" ] ; then
   memory_kbytes=$(( $joboption_memory * 1024 ))
   memory_req=$(( $joboption_memory ))
   mult_factor=1
   div_factor=1         
   # HTCondor needs to know the total memory for the job, not memory per core
   if [ ! -z $joboption_count ] && [ $joboption_count -gt 0 ] ; then    
     # update 2017-03-07 : make sure jobs request VMEM : request 1.5x more mem if mem requirement is lower than 2GB (arbitrary) per core
     # note : bash only handles ints, so make sure x1.5 = x 15 / 10
     if [ $memory_req -le 2048 ] ; then
        mult_factor=15
        div_factor=10
     fi
     memory_kbytes=$[ memory_kbytes * joboption_count * mult_factor / div_factor ]
     memory_req=$[ memory_req * joboption_count * mult_factor / div_factor ]
   fi
   echo "request_memory=$memory_req" >> $LRMS_JOB_DESCRIPT
   echo "+JobMemoryLimit = $memory_kbytes" >> $LRMS_JOB_DESCRIPT
   # incompatible with cgroups : jobs are killed instead of starting to swap.
   # see : https://www.gridpp.ac.uk/wiki/Enable_Cgroups_in_HTCondor
   #REMOVE="${REMOVE} || ResidentSetSize > JobMemoryLimit"
 fi

When cgroups are not working, this can be seen in /var/log/condor/StarterLog.slot* where one cas see such lines :

 03/06/17 04:28:26 (pid:1158843) Limiting (soft) memory usage to 16777216000 bytes
 03/06/17 04:28:26 (pid:1158843) Limiting (hard) memory usage to 9135382528 bytes
 03/06/17 04:28:26 (pid:1158843) Unable to commit memory soft limit for htcondor/condor_home_condor_slot1_4@wn358.stripped.domain : 50016 Invalid argument
 03/06/17 04:28:26 (pid:1158843) Limiting memsw usage to 9135386624 bytes
 03/06/17 04:28:26 (pid:1158843) Unable to commit memsw limit for htcondor/condor_home_condor_slot1_4@wn358.stripped.domain : 50016 Invalid argument
 03/06/17 04:28:26 (pid:1158843) Unable to commit CPU shares for htcondor/condor_home_condor_slot1_4@wn358.stripped.domain: 50016 Invalid argument
 03/06/17 05:14:44 (pid:1158843) Hold all jobs
 03/06/17 05:14:44 (pid:1158843) Job was held due to OOM event: Job has gone over memory limit of 16000 megabytes.

One solution to this apparently is to : stop condor. restart the cgconfig service . start condor (this will kill running jobs...)

In order to detect this (in addition to seeing many jobs held), we therefore added a nagios test looking for this regex :

 'Unable to commit [a-zA-Z0-9]* (limit|share) for htcondor'

Be warned that these errors do not vanish quickly from the logs so if you want to filter out errors, you might consider setting the condor logging variable to something like this and filter on epoch time as we did :

 DEBUG_TIME_FORMAT = %s %Y/%m/%d %H:%M:%S