Example Build of an ARC/Condor Cluster

From GridPP Wiki
Jump to: navigation, search

Introduction

A multicore job is one which needs to use more than one processor on a node. Until recently, multicore jobs have not been used much on the grid infrastructure. This has all changed because Atlas and other large users have now asked sites to enable multicore on their clusters.

Unfortunately, it is not just a simple task of setting some parameter on the head node and sitting back while jobs arrive. Different grid systems have varying levels of support for multicore, ranging from non-existent to virtually full support.

This report discusses the multicore configuration at Liverpool. We decided to build a cluster using one of the most capable batch systems currently available, called HTCondor (or CONDOR for short). We also decided to front the system with an ARC CE.

I thank Andrew Lahiff at RAL for the initial configuration and many suggestions and help. Some links to some of Andrew's material are in the “See Also” section.

Important Documents

You'll need a copy of the ARC System Admin Manual.

http://www.nordugrid.org/documents/arc-ce-sysadm-guide.pdf

And a copy of the Condor System Admin Manual (this one is for 8.2.2).

http://research.cs.wisc.edu/htcondor/manual/v8.2/condor-V8_2_2-Manual.pdf

Infrastructure/Fabric

The multicore cluster consists of an SL6 headnode to run the ARC CE and the Condor batch system. The headnode has a dedicated set of 57 workernodes of various types, providing a total of around 652 single threads of execution, which I shall call unislots, or slots for short.

Head Node

The headnode is a virtual system running on KVM.

Head node hardware
Host Name OS CPUs RAM Disk Space


hepgrid2.ph.liv.ac.uk SL6.4 5 4 gig 35 gig

Worker nodes

The physical workernodes are described below.

Worker node hardware
Node names CPU type OS RAM Disk Space CPUs Per Node Slots used per cpu Slots used per node Total nodes Total CPUs Total slots HEPSPEC per slot Total HEPSPEC


r21-n01 to n04 E5620 SL6.4 24 GB 1.5 TB 2 5 10 4 8 40 12.05 482
r21-n05 to n20 X5650 SL6.4 50 GB 2 TB 2 8 16 16 32 256 12.29 3147
r22-n01 to n20 E5620 SL6.4 24 GB 1.5 TB 2 5 10 20 40 200 12.05 2410
r23-n01 to n10 E5620 SL6.4 24 GB 1.5 TB 2 5 10 10 20 100 12.05 1205




r26-n05 to n11 L5420 SL6.4 16 GB 1.7 TB 2 4 8 7 14 56 8.86 502
TOTALS 57 114 652 7745


Software Builds and Configuration

There are a few particulars of the Liverpool site that I want to get out of the way to start with. For the initial installation of an operating system on our head nodes and worker nodes, we use tools developed at Liverpool (BuildTools) based on Kickstart, NFS, TFTP and DHCP. The source (synctool.pl and linktool.pl) can be obtained from sjones@hep.ph.liv.ac.uk. Alternatively, similar functionality is said to exist in the Cobler suite, which is released as Open Source and some sites have based their initial install on that. Once the OS is on, the first reboot starts Puppet to give a personality to the node. Puppet is becoming something of a de-facto standard in its own right, so I'll use some puppet terminology within this document where some explanation of a particular feature is needed.

Special Software Control Measures

The software for the installation is all contained in various yum repositories. Here at Liverpool, we maintain two mirrored copies of the yum material. One of them, the online repository, is mirrored daily from the Internet. It is not used for any installation. The other copy, termed the local repository, is used to take a snapshot when necessary of the online repository. Installations are done from the local repository. Thus we maintain precise control of the software we use. There is no need to make any further reference to this set-up.

We'll start with the headnode and "work down" so to speak.

Yum repos

This table shows the origin of the software releases via yum repositories.

Yum Repositories
Product Where Yum repo Source Keys
ARC Head node http://download.nordugrid.org/repos/13.11/centos/el6/x86_64/base, http://download.nordugrid.org/repos/13.11/centos/el6/x86_64/updates http://download.nordugrid.org/repos/13.11/centos/el6/source http://download.nordugrid.org/RPM-GPG-KEY-nordugrid
VomsSnooper Head node http://www.sysadmin.hep.ac.uk/rpms/fabric-management/RPMS.vomstools/ null null


Condor: Head and Worker http://research.cs.wisc.edu/htcondor/yum/stable/rhel6 null null


WLCG Head and Worker http://linuxsoft.cern.ch/wlcg/sl6/x84_64 null null


Trust anchors Head and Worker http://repository.egi.eu/sw/production/cas/1/current/ null null
Puppet Head and Worker http://yum.puppetlabs.com/el/6/products/x86_64 null null
epel Head and worker http://download.fedoraproject.org/pub/epel/6/x86_64/ null null
emi Head and Worker http://emisoft.web.cern.ch/emisoft/dist/EMI/3/sl6//x86_64/base,http://emisoft.web.cern.ch/emisoft/dist/EMI/3/sl6//x86_64/third-party, http://emisoft.web.cern.ch/emisoft/dist/EMI/3/sl6//x86_64/updates null null
CernVM-packages: Worker http://map2.ph.liv.ac.uk//yum/cvmfs/EL/6.4/x86_64/ null http://cvmrepo.web.cern.ch/cvmrepo/yum/RPM-GPG-KEY-CernVM


Head Node

Head Standard build

The basis for the initial build follows the standard model for any grid server node at Liverpool. I won't explain that in detail – each site is likely to have its own standard, which is general to all the components used to build any grid node (such as a CE, ARGUS, BDII, TORQUE etc.) but prior to any middleware. Such a baseline build might include networking, iptables, nagios scripts, ganglia, ssh etc.

Head Extra Directories

I had to make these specific directories myself:

/etc/arc/runtime/ENV
/etc/condor/ral
/etc/lcmaps/
/root/glitecfg/services
/root/scripts
/var/spool/arc/debugging
/var/spool/arc/grid
/var/spool/arc/jobstatus

Head Additional Packages

These packages were needed to add the middleware required, i.e. ARC, Condor and ancillary material.

Additional Packages
Package Description
nordugrid-arc-compute-element The ARC CE Middleware
condor HT Condor, the main batch server package, 8.2.2
apel-client Accounting, ARC/Condor bypasses the APEL server and goes direct.


ca_policy_igtf-classic Certificates
lcas-plugins-basic Security
lcas-plugins-voms Security
lcas Security
lcmaps Security
lcmaps-plugins-basic Security
lcmaps-plugins-c-pep Security
lcmaps-plugins-verify-proxy Security
lcmaps-plugins-voms Security


globus-ftp-control Extra packages for Globus
globus-gsi-callback Extra packages for Globus


VomsSnooper VOMS Helper, used to set up the LSC (list of Certificates) files
glite-yaim-core Yaim,just use Yaim to make accounts.
yum-plugin-priorities.noarch Helpers for Yum
yum-plugin-protectbase.noarch Helpers for Yum
yum-utils Helpers for Yum


Head Files

The following set of files were additionally installed. Some of them are empty. Some of them can be used as they are. Others have to be edited to fit your site. Any that is a script must have executable permissions (e.g. 755).


  • File: /root/scripts/set_defrag_parameters.sh
  • Notes: This script senses changes to the running and queueing job load, and sets parameters related to defragmentation. This allows the cluster to support a load consisting of both multicore and singlecore jobs.
  • Customise: Yes. You'll need to edit it it to suit your site. BTW: I'm experimenting with a swanky new version that involves a rate controlller. I'll report on that in due course.
  • Content:
#!/bin/bash
#
# Change condor_defrag daemon parameters depending on what's queued

function setDefrag () {

   # Get the address of the defrag daemon
   defrag_address=$(condor_status -any -autoformat MyAddress -constraint 'MyType =?= "Defrag"')

   # Log
   echo `date` " Setting DEFRAG_MAX_CONCURRENT_DRAINING=$3, DEFRAG_DRAINING_MACHINES_PER_HOUR=$4, DEFRAG_MAX_WHOLE_MACHINES=$5 (queued multicore=$1, running multicore=$2)"

   # Set configuration
   /usr/bin/condor_config_val -address "$defrag_address" -rset "DEFRAG_MAX_CONCURRENT_DRAINING = $3" >& /dev/null
   /usr/bin/condor_config_val -address "$defrag_address" -rset "DEFRAG_DRAINING_MACHINES_PER_HOUR = $4" >& /dev/null
   /usr/bin/condor_config_val -address "$defrag_address" -rset "DEFRAG_MAX_WHOLE_MACHINES = $5" >& /dev/null
   /usr/sbin/condor_reconfig -daemon defrag >& /dev/null
}

function cancel_draining_nodes () {
  # Get draining nodes
  for dn in `condor_status | grep Drained | sed -e "s/.*@//" -e "s/\..*//" `; do
    slot1=0
    condor_status -long $dn| while read line; do
  
      # Toggle if slot1@ (not slot1_...). slot1@ lists the empty (i.e. drained) total
      if  $line =~ ^Name.*slot1@.*$  ; then
        slot1=1
      fi
      if  $line =~ ^Name.*slot1_.*$  ; then
        slot1=0
      fi
    
      if [ $slot1 == 1 ]; then
        if  $line =~ ^Cpus\ \=\ (.*)$  ; then
  
          # We must capture empty/drained total
          cpus="${BASH_REMATCH[1]}"
          if [ $cpus -ge 8 ]; then
            # We have enough already. Pointless waiting longer.
            echo Cancel drain of $dn, as we have $cpus free already
            condor_drain -cancel $dn
          fi
        fi
      fi
    done
  done
}

queued_mc_jobs=$(condor_q -global -constraint 'RequestCpus == 8 && JobStatus == 1' -autoformat ClusterId | wc -l)

queued_sc_jobs=$(condor_q -global -constraint 'RequestCpus == 1 && JobStatus == 1' -autoformat ClusterId | wc -l)

running_mc_jobs=$(condor_q -global -constraint 'RequestCpus == 8 && JobStatus == 2' -autoformat ClusterId | wc -l)

running_sc_jobs=$(condor_q -global -constraint 'RequestCpus == 1 && JobStatus == 2' -autoformat ClusterId | wc -l)

queued_mc_slots=`expr $queued_mc_jobs \* 8`

queued_sc_slots=$queued_sc_jobs

# Ratio control
P_SETPOINT=0.5    # When the ratio between multicore and singlecore is more than this, take action

#CONSTANTS
C_MxWM=1000  # At max, pay no heed to how many whole systems
C_MxDH=3    # At max, kick off N per hour to drain
C_MxCD=2     # At max, never more than Xth of cluster should defrag at once (for goodness sake)

C_MnWM=6    # At min, don't bother if n already whole
C_MnDH=1    # At min, only start 1 per hour max
C_MnCD=1    # At min, don't bother if n already going

C_ZWM=0    # At zero, don't bother if 0 already whole
C_ZDH=0    # At zero, only start 0 per hour max
C_ZCD=0    # At zero, don't bother if 0 already going


if [ $queued_sc_slots -le 3 ]; then
  # Very few sc jobs. Max defrag.
  setDefrag $queued_mc_jobs $running_mc_jobs $C_MxCD $C_MxDH $C_MxWM
else
  if [ $queued_mc_slots -le 1 ]; then
    # More than a couple of sc jobs, and almost no mc jobs.
    # No defraging starts,  cancel current defraging
    setDefrag $queued_mc_jobs $running_mc_jobs $C_ZCD $C_ZDH $C_ZWM
    cancel_draining_nodes
  else
    # More than a couple of sc jobs, and mc jobs 
    RATIO=`echo "$queued_mc_slots / $queued_sc_slots" | bc -l`
    RESULT=$(echo "${RATIO} > ${P_SETPOINT}" | bc -l )
    
    if [ $RESULT -eq 1 ]; then
      # Surplus of MC over SC, lots of defrag. 
      setDefrag $queued_mc_jobs $running_mc_jobs $C_MxCD $C_MxDH $C_MxWM    
    else
      # Not More MC than SC, little of defrag
      setDefrag $queued_mc_jobs $running_mc_jobs $C_MnCD $C_MnDH $C_MnWM    
    fi
  fi
fi

# Raise priority of MC jobs
/root/scripts/condor_q_cores.pl > /tmp/c

# Put all the MC records in one file, with I jobs only
grep ^MC /tmp/c | grep ' I ' > /tmp/mc.c

# Go over those queued multicore jobs and up thier prio
for j in `cat /tmp/mc.c | sed -e "s/\S*\s//" -e "s/ .*//"`; do condor_prio -p 6 $j; done
rm /tmp/c /tmp/mc.c


exit


  • File: /etc/arc.conf
  • Notes: The main configuration file of the ARC CE. It adds support for scaling factors, APEL reporting, ARGUS Mapping, BDII publishing (power and scaling), multiple VO support, and default limits.
  • Customise: Yes. You'll need to edit it it to suit your site.
  • Content:
[common]
x509_user_key="/etc/grid-security/hostkey.pem"
x509_user_cert="/etc/grid-security/hostcert.pem"
x509_cert_dir="/etc/grid-security/certificates"
gridmap="/etc/grid-security/grid-mapfile"
lrms="condor" 

[grid-manager]
debug="1"
enable_emies_interface="yes"
arex_mount_point="https://hepgrid2.ph.liv.ac.uk:443/arex"
user="root"
controldir="/var/spool/arc/jobstatus"
sessiondir="/var/spool/arc/grid"
runtimedir="/etc/arc/runtime"
logfile="/var/log/arc/grid-manager.log"
pidfile="/var/run/grid-manager.pid"
joblog="/var/log/arc/gm-jobs.log"
shared_filesystem="no" 
authplugin="PREPARING timeout=60,onfailure=pass,onsuccess=pass /usr/local/bin/default_rte_plugin.py %S %C %I ENV/GLITE"
authplugin="FINISHING timeout=60,onfailure=pass,onsuccess=pass /usr/local/bin/scaling_factors_plugin.py %S %C %I"
# This copies the files containing useful output from completed jobs into a directory /var/spool/arc/debugging 
#authplugin="FINISHED timeout=60,onfailure=pass,onsuccess=pass /usr/local/bin/debugging_rte_plugin.py %S %C %I"

mail="root@hep.ph.liv.ac.uk"
jobreport="APEL:http://mq.cro-ngi.hr:6162"
jobreport_options="urbatch:1000,archiving:/var/run/arc/urs,topic:/queue/global.accounting.cpu.central,gocdb_name:UKI-NORTHGRID-LIV-HEP,use_ssl:true,Network:PROD,benchmark_type:Si2k,benchmark_value:2500.00"
jobreport_credentials="/etc/grid-security/hostkey.pem /etc/grid-security/hostcert.pem /etc/grid-security/certificates"
jobreport_publisher="jura_dummy"
# Disable (1 month !)
jobreport_period=2500000

[gridftpd]

user="root"
debug="1"
logfile="/var/log/arc/gridftpd.log"
pidfile="/var/run/gridftpd.pid"
port="2811"
allowunknown="yes"
globus_tcp_port_range="20000,24999"
globus_udp_port_range="20000,24999"

#
# Notes:
#
# The first two args are implicitly given to arc-lcmaps, and are
#    argv[1] - the subject/DN
#    argv[2] - the proxy file
#
# The remain attributes are explicit, after the "lcmaps" field in the examples below.
#    argv[3] - lcmaps_library
#    argv[4] - lcmaps_dir
#    argv[5] - lcmaps_db_file
#    argv[6 etc.] - policynames
#
# lcmaps_dir and/or lcmaps_db_file may be '*', in which case they are
# fully truncated (placeholders).
#
# Some logic is applied. If the lcmaps_library is not specified with a
# full path, it is given the path of the lcmaps_dir. We have to assume that
# the lcmaps_dir is a poor name for that field, as discussed in the following
# examples.
#
# Examples:
#   In this example, used at RAL, the liblcmaps.so is given no
#   path, so it is assumes to exist in /usr/lib64 (note the poorly
#   named field - the lcmaps_dir is populated by a library path.)
#
# Fieldnames:      lcmaps_lib   lcmaps_dir lcmaps_db_file            policy
#unixmap="* lcmaps liblcmaps.so /usr/lib64 /usr/etc/lcmaps/lcmaps.db arc"
#
#   In the next example, used at Liverpool, lcmaps_lib is fully qualified. Thus
#   the lcmaps_dir is not used (although is does set the LCMAPS_DIR env var).
#   In this case, the lcmaps_dir really does contain the lcmaps dir location.
#
# Fieldnames:      lcmaps_lib              lcmaps_dir  lcmaps_db_file policy
unixmap="* lcmaps  /usr/lib64/liblcmaps.so /etc/lcmaps lcmaps.db      arc"
unixmap="arcfailnonexistentaccount:arcfailnonexistentaccount all"


[gridftpd/jobs]
path="/jobs"
plugin="jobplugin.so"
allownew="yes" 

[infosys]
user="root"
overwrite_config="yes"
port="2135"
debug="1"
registrationlog="/var/log/arc/inforegistration.log"
providerlog="/var/log/arc/infoprovider.log"
provider_loglevel="2"
infosys_glue12="enable"
infosys_glue2_ldap="enable"

[infosys/glue12]
resource_location="Liverpool, UK"
resource_longitude="-2.964"
resource_latitude="53.4035"
glue_site_web="http://www.gridpp.ac.uk/northgrid/liverpool"
glue_site_unique_id="UKI-NORTHGRID-LIV-HEP"
cpu_scaling_reference_si00="2970"
processor_other_description="Cores=5.72,Benchmark=11.88-HEP-SPEC06"
provide_glue_site_info="false"

[infosys/admindomain]
name="UKI-NORTHGRID-LIV-HEP"

# infosys view of the computing cluster (service)
[cluster]
name="hepgrid2.ph.liv.ac.uk"
localse="hepgrid11.ph.liv.ac.uk"
cluster_alias="hepgrid2 (UKI-NORTHGRID-LIV-HEP)"
comment="UKI-NORTHGRID-LIV-HEP Main Grid Cluster"
homogeneity="True"
nodecpu="Intel(R) Xeon(R) CPU L5420 @ 2.50GHz"
architecture="x86_64"
nodeaccess="inbound"
nodeaccess="outbound"
#opsys="SL64"
opsys="ScientificSL : 6.4 : Carbon"
nodememory="3000"

authorizedvo="alice"
authorizedvo="atlas"
authorizedvo="biomed"
authorizedvo="calice"
authorizedvo="camont"
authorizedvo="cdf"
authorizedvo="cernatschool.org"
authorizedvo="cms"
authorizedvo="dteam"
authorizedvo="dzero"
authorizedvo="epic.vo.gridpp.ac.uk"
authorizedvo="esr"
authorizedvo="fusion"
authorizedvo="geant4"
authorizedvo="gridpp"
authorizedvo="hone"
authorizedvo="hyperk.org"
authorizedvo="ilc"
authorizedvo="lhcb"
authorizedvo="lsst"
authorizedvo="magic"
authorizedvo="mice"
authorizedvo="na62.vo.gridpp.ac.uk"
authorizedvo="neiss.org.uk"
authorizedvo="ops"
authorizedvo="pheno"
authorizedvo="planck"
authorizedvo="snoplus.snolab.ca"
authorizedvo="t2k.org"
authorizedvo="vo.northgrid.ac.uk"
authorizedvo="vo.sixt.cern.ch"
authorizedvo="zeus"

benchmark="SPECINT2000 2970"
benchmark="SPECFP2000 2970"
totalcpus=652

[queue/grid]
name="grid"
homogeneity="True"
comment="Default queue"
nodecpu="adotf"
architecture="adotf"
defaultmemory=3000
maxrunning=1400
totalcpus=652
maxuserrun=1400
maxqueuable=2800
#maxcputime=2880
#maxwalltime=2880


  • File: /etc/arc/runtime/ENV/GLITE
  • Notes: The GLITE runtime environment.
  • Content:
#!/bin/sh

#export LD_LIBRARY_PATH=/opt/xrootd/lib
export GLOBUS_LOCATION=/usr

if [ "x$1" = "x0" ]; then
  # Set environment variable containing queue name
  env_idx=0
  env_var="joboption_env_$env_idx"
  while [ -n "${!env_var}" ]; do
     env_idx=$((env_idx+1))
     env_var="joboption_env_$env_idx"
  done 
  eval joboption_env_$env_idx="NORDUGRID_ARC_QUEUE=$joboption_queue"
	
  export RUNTIME_ENABLE_MULTICORE_SCRATCH=1

fi

if [ "x$1" = "x1" ]; then
  # Set grid environment
  if [ -e /etc/profile.d/env.sh ]; then
     source /etc/profile.d/env.sh
  fi 
  if [ -e /etc/profile.d/zz-env.sh ]; then
     source /etc/profile.d/zz-env.sh
  fi
  export LD_LIBRARY_PATH=/opt/xrootd/lib

  # Set basic environment variables
  export GLOBUS_LOCATION=/usr
  HOME=`pwd`
  export HOME
  USER=`whoami`
  export USER
  HOSTNAME=`hostname -f`
  export HOSTNAME
fi

export VO_ALICE_SW_DIR=/opt/exp_soft_sl5/alice
export VO_ATLAS_SW_DIR=/cvmfs/atlas.cern.ch/repo/sw
export VO_BIOMED_SW_DIR=/opt/exp_soft_sl5/biomed
export VO_CALICE_SW_DIR=/opt/exp_soft_sl5/calice
export VO_CAMONT_SW_DIR=/opt/exp_soft_sl5/camont
export VO_CDF_SW_DIR=/opt/exp_soft_sl5/cdf
export VO_CERNATSCHOOL_ORG_SW_DIR=/opt/exp_soft_sl5/cernatschool
export VO_CMS_SW_DIR=/opt/exp_soft_sl5/cms
export VO_DTEAM_SW_DIR=/opt/exp_soft_sl5/dteam
export VO_DZERO_SW_DIR=/opt/exp_soft_sl5/dzero
export VO_EPIC_VO_GRIDPP_AC_UK_SW_DIR=/opt/exp_soft_sl5/epic
export VO_ESR_SW_DIR=/opt/exp_soft_sl5/esr
export VO_FUSION_SW_DIR=/opt/exp_soft_sl5/fusion
export VO_GEANT4_SW_DIR=/opt/exp_soft_sl5/geant4
export VO_GRIDPP_SW_DIR=/opt/exp_soft_sl5/gridpp
export VO_HONE_SW_DIR=/cvmfs/hone.egi.eu
export VO_HYPERK_ORG_SW_DIR=/cvmfs/hyperk.egi.eu
export VO_ILC_SW_DIR=/cvmfs/ilc.desy.de
export VO_LHCB_SW_DIR=/cvmfs/lhcb.cern.ch
export VO_LSST_SW_DIR=/opt/exp_soft_sl5/lsst
export VO_MAGIC_SW_DIR=/opt/exp_soft_sl5/magic
export VO_MICE_SW_DIR=/cvmfs/mice.egi.eu
export VO_NA62_VO_GRIDPP_AC_UK_SW_DIR=/cvmfs/na62.egi.eu
export VO_NEISS_ORG_UK_SW_DIR=/opt/exp_soft_sl5/neiss
export VO_OPS_SW_DIR=/opt/exp_soft_sl5/ops
export VO_PHENO_SW_DIR=/opt/exp_soft_sl5/pheno
export VO_PLANCK_SW_DIR=/opt/exp_soft_sl5/planck
export VO_SNOPLUS_SNOLAB_CA_SW_DIR=/cvmfs/snoplus.egi.eu
export VO_T2K_ORG_SW_DIR=/cvmfs/t2k.egi.eu
export VO_VO_NORTHGRID_AC_UK_SW_DIR=/opt/exp_soft_sl5/northgrid
export VO_VO_SIXT_CERN_CH_SW_DIR=/opt/exp_soft_sl5/sixt
export VO_ZEUS_SW_DIR=/opt/exp_soft_sl5/zeus

export RUCIO_HOME=/cvmfs/atlas.cern.ch/repo/sw/ddm/rucio-clients/0.1.12
export RUCIO_AUTH_TYPE=x509_proxy 

export LCG_GFAL_INFOSYS="lcg-bdii.gridpp.ac.uk:2170,topbdii.grid.hep.ph.ic.ac.uk:2170"


  • File: /etc/condor/config.d/14accounting-groups-map.config
  • Notes: Implements accounting groups, so that fairshares can be used that refer to whole groups of users, instead of individual ones.
  • Customise: Yes. You'll need to edit it to suit your site.
  • Content:
# Liverpool Tier-2 HTCondor configuration: accounting groups 

# Primary group
# Assign individual test submitters into the HIGHPRIO group, 
# else just assign job into primary group of its VO
LivAcctGroup = ifThenElse(regexp("sgmatl34",Owner),         "group_HIGHPRIO", \
               ifThenElse(regexp("sgmops11",Owner),         "group_HIGHPRIO", \
               strcat("group_",toUpper(x509UserProxyVOName))))

# Subgroup
# For the subgroup, just assign job to the group of the owner (i.e. owner name less all those digits at the end).
# Also show whether multi or single core.
LivAcctSubGroup = strcat(regexps("([A-Za-z0-9]+[A-Za-z])\d+", Owner, "\1"),ifThenElse(RequestCpus > 1,"_mcore","_score"))

# Now build up the whole accounting group
AccountingGroup = strcat(LivAcctGroup, ".", LivAcctSubGroup, ".", Owner)

# Add these ClassAd specifications to the submission expressions
SUBMIT_EXPRS = $(SUBMIT_EXPRS) LivAcctGroup, LivAcctSubGroup, AccountingGroup 


  • File: /etc/condor/config.d/11fairshares.config
  • Notes: Implements fair share settings, relying on groups of users.
  • Customise: Yes. You'll need to edit it to suit your site.
  • Content:
# Liverpool Tier-2 HTCondor configuration: fairshares

# use this to stop jobs from starting.
# CONCURRENCY_LIMIT_DEFAULT = 0

# Half-life of user priorities
PRIORITY_HALFLIFE = 259200

# Handle surplus
GROUP_ACCEPT_SURPLUS = True
GROUP_AUTOREGROUP = True

# Weight slots using CPUs
#NEGOTIATOR_USE_SLOT_WEIGHTS = True

# See: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3271
NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION = False

# Calculate the surplus allocated to each group correctly
NEGOTIATOR_USE_WEIGHTED_DEMAND = True

# Group names
GROUP_NAMES = \
	group_HIGHPRIO,  \
	group_ALICE,  \
	group_ATLAS,  \
	group_BIOMED,  \
	group_CALICE,  \
	group_CAMONT,  \
	group_CDF,  \
        group_LSST,  \
	group_CERNATSCHOOL_ORG,  \
	group_CMS,  \
	group_DTEAM,  \
	group_DZERO,  \
	group_EPIC_VO_GRIDPP_AC_UK,  \
	group_ESR,  \
	group_FUSION,  \
	group_GEANT4,  \
	group_GRIDPP,  \
	group_HONE,  \
	group_HYPERK_ORG,  \
	group_ILC,  \
	group_LHCB,  \
	group_MAGIC,  \
	group_MICE,  \
	group_NA62_VO_GRIDPP_AC_UK,  \
	group_NEISS_ORG_UK,  \
	group_OPS,  \
	group_PHENO,  \
	group_PLANCK,  \
	group_SNOPLUS_SNOLAB_CA,  \
	group_T2K_ORG,  \
	group_VO_NORTHGRID_AC_UK,  \
	group_VO_SIXT_CERN_CH,  \
	group_ZEUS,  \

# Fairshares
GROUP_QUOTA_DYNAMIC_group_HIGHPRIO  = 0.05

GROUP_QUOTA_DYNAMIC_group_ALICE =  0.05
GROUP_QUOTA_DYNAMIC_group_ATLAS =  0.65
GROUP_QUOTA_DYNAMIC_group_BIOMED =  0.01
GROUP_QUOTA_DYNAMIC_group_CALICE =  0.01
GROUP_QUOTA_DYNAMIC_group_CAMONT =  0.01
GROUP_QUOTA_DYNAMIC_group_CDF =  0.01
GROUP_QUOTA_DYNAMIC_group_LSST =  0.01
GROUP_QUOTA_DYNAMIC_group_CERNATSCHOOL_ORG =  0.01
GROUP_QUOTA_DYNAMIC_group_CMS =  0.01
GROUP_QUOTA_DYNAMIC_group_DTEAM =  0.01
GROUP_QUOTA_DYNAMIC_group_DZERO =  0.01
GROUP_QUOTA_DYNAMIC_group_EPIC_VO_GRIDPP_AC_UK =  0.01
GROUP_QUOTA_DYNAMIC_group_ESR =  0.01
GROUP_QUOTA_DYNAMIC_group_FUSION =  0.01
GROUP_QUOTA_DYNAMIC_group_GEANT4 =  0.01
GROUP_QUOTA_DYNAMIC_group_GRIDPP =  0.01
GROUP_QUOTA_DYNAMIC_group_HONE =  0.01
GROUP_QUOTA_DYNAMIC_group_HYPERK_ORG =  0.01
GROUP_QUOTA_DYNAMIC_group_ILC =  0.01
GROUP_QUOTA_DYNAMIC_group_LHCB =  0.20
GROUP_QUOTA_DYNAMIC_group_MAGIC =  0.01
GROUP_QUOTA_DYNAMIC_group_MICE =  0.01
GROUP_QUOTA_DYNAMIC_group_NA62_VO_GRIDPP_AC_UK =  0.01
GROUP_QUOTA_DYNAMIC_group_NEISS_ORG_UK =  0.01
GROUP_QUOTA_DYNAMIC_group_OPS =  0.01
GROUP_QUOTA_DYNAMIC_group_PHENO =  0.01
GROUP_QUOTA_DYNAMIC_group_PLANCK =  0.01
GROUP_QUOTA_DYNAMIC_group_SNOPLUS_SNOLAB_CA =  0.01
GROUP_QUOTA_DYNAMIC_group_T2K_ORG =  0.01
GROUP_QUOTA_DYNAMIC_group_VO_NORTHGRID_AC_UK =  0.01
GROUP_QUOTA_DYNAMIC_group_VO_SIXT_CERN_CH =  0.01
GROUP_QUOTA_DYNAMIC_group_ZEUS =  0.01


  • File: /etc/condor/pool_password
  • Notes: Will have its own section (TBD)
  • Customise: Yes.
  • Content:


 Password Authentication
 The password method provides mutual authentication through the use of a shared 
 secret. This is  often a good choice when strong security is desired, but an existing 
 Kerberos or X.509 infrastructure is not in place. Password authentication
 is available on both Unix andWindows. It currently can only be used for daemon
 -to-daemon authentication. The shared secret in this context is referred to as 
 the pool password. Before a daemon can use password authentication, the pool 
 password must be stored on the daemon’s local machine. On Unix, the password will 
 be placed in a file defined by the configuration variable SEC_PASSWORD_FILE. This file 
 will be accessible only by the UID that HTCondor is started as. OnWindows, the same 
 secure password store that is used for user passwords will be used for the pool 
 password (see section 7.2.3). Under Unix, the password file can be generated by 
 using the following command to write directly to the password file:
 condor_store_cred -f /path/to/password/file
 
  • File: /etc/condor/condor_config.local
  • Notes: The main client CONDOR configuration custom file.
  • Customise: Yes. You'll need to edit it to suit your site.
  • Content:
##  What machine is your central manager?

CONDOR_HOST = $(FULL_HOSTNAME)

## Pool's short description

COLLECTOR_NAME = Condor at $(FULL_HOSTNAME)

##  When is this machine willing to start a job? 

START = FALSE

##  When to suspend a job?

SUSPEND = FALSE

##  When to nicely stop a job?
# When a job is running and the PREEMPT expression evaluates to True, the
# condor_startd will evict the job. The PREEMPT expression s hould reflect the
# requirements under which the machine owner will not permit a job to continue to run.
# For example, a policy to evict a currently running job when a key is hit or when
# it is the 9:00am work arrival time, would be expressed in the PREEMPT expression
# and enforced by the condor_startd.

PREEMPT = FALSE

# If there is a job from a higher priority user sitting idle, the
# condor_negotiator daemon may evict a currently running job submitted
# from a lower priority user if PREEMPTION_REQUIREMENTS is True.

PREEMPTION_REQUIREMENTS = FALSE

# No job has pref over any other

#RANK = FALSE

##  When to instantaneously kill a preempting job
##  (e.g. if a job is in the pre-empting stage for too long)

KILL = FALSE

##  This macro determines what daemons the condor_master will start and keep its watchful eyes on.
##  The list is a comma or space separated list of subsystem names

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

#######################################
# Andrew Lahiff's scaling

MachineRalScaling = "$$([ifThenElse(isUndefined(RalScaling), 1.00, RalScaling)])"
MachineRalNodeLabel = "$$([ifThenElse(isUndefined(RalNodeLabel), 1.00, RalNodeLabel)])"
SUBMIT_EXPRS = $(SUBMIT_EXPRS) MachineRalScaling MachineRalNodeLabel
 
#######################################
# Andrew Lahiff's security

ALLOW_WRITE = 

UID_DOMAIN = ph.liv.ac.uk

CENTRAL_MANAGER1 = hepgrid2.ph.liv.ac.uk
COLLECTOR_HOST = $(CENTRAL_MANAGER1)

# Central managers
CMS = condor_pool@$(UID_DOMAIN)/hepgrid2.ph.liv.ac.uk

# CEs
CES = condor_pool@$(UID_DOMAIN)/hepgrid2.ph.liv.ac.uk

# Worker nodes
WNS = condor_pool@$(UID_DOMAIN)/192.168.*

# Users
USERS = *@$(UID_DOMAIN)
USERS = *

# Required for HA
HOSTALLOW_NEGOTIATOR = $(COLLECTOR_HOST)
HOSTALLOW_ADMINISTRATOR = $(COLLECTOR_HOST)
HOSTALLOW_NEGOTIATOR_SCHEDD = $(COLLECTOR_HOST)

# Authorization
HOSTALLOW_WRITE =
ALLOW_READ = */*.ph.liv.ac.uk
NEGOTIATOR.ALLOW_WRITE = $(CES), $(CMS)
COLLECTOR.ALLOW_ADVERTISE_MASTER = $(CES), $(CMS), $(WNS)
COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(CES)
COLLECTOR.ALLOW_ADVERTISE_STARTD = $(WNS)
SCHEDD.ALLOW_WRITE = $(USERS)
SHADOW.ALLOW_WRITE = $(WNS), $(CES)
ALLOW_DAEMON = condor_pool@$(UID_DOMAIN)/*.ph.liv.ac.uk, $(FULL_HOSTNAME)
ALLOW_ADMINISTRATOR = root@$(UID_DOMAIN)/$(IP_ADDRESS), condor_pool@$(UID_DOMAIN)/$(IP_ADDRESS), $(CMS)
ALLOW_CONFIG = root@$(FULL_HOSTNAME)

# Don't allow nobody to run jobs
SCHEDD.DENY_WRITE = nobody@$(UID_DOMAIN)

# Authentication
SEC_PASSWORD_FILE = /etc/condor/pool_password
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_READ_AUTHENTICATION = OPTIONAL
SEC_CLIENT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = PASSWORD,FS
SCHEDD.SEC_WRITE_AUTHENTICATION_METHODS = FS,PASSWORD
SCHEDD.SEC_DAEMON_AUTHENTICATION_METHODS = FS,PASSWORD
SEC_CLIENT_AUTHENTICATION_METHODS = FS,PASSWORD,CLAIMTOBE
SEC_READ_AUTHENTICATION_METHODS = FS,PASSWORD,CLAIMTOBE

# Integrity
SEC_DEFAULT_INTEGRITY  = REQUIRED
SEC_DAEMON_INTEGRITY = REQUIRED
SEC_NEGOTIATOR_INTEGRITY = REQUIRED

# Multicore
DAEMON_LIST = $(DAEMON_LIST) DEFRAG

DEFRAG_SCHEDULE = graceful

DEFRAG_INTERVAL = 90  
DEFRAG_MAX_CONCURRENT_DRAINING = 1 
DEFRAG_DRAINING_MACHINES_PER_HOUR = 1.0
DEFRAG_MAX_WHOLE_MACHINES = 4

## Allow some defrag configuration to be settable
DEFRAG.SETTABLE_ATTRS_ADMINISTRATOR = DEFRAG_MAX_CONCURRENT_DRAINING,DEFRAG_DRAINING_MACHINES_PER_HOUR,DEFRAG_MAX_WHOLE_MACHINES
ENABLE_RUNTIME_CONFIG = TRUE

# The defrag depends on the number of spares already present, biased towards systems with many cpus
DEFRAG_RANK = Cpus * pow(TotalCpus,(1.0 / 2.0))

# Definition of a "whole" machine:
DEFRAG_WHOLE_MACHINE_EXPR =  Cpus >= 8 && StartJobs =?= True && RalNodeOnline =?= True

# Cancel once we have 8
DEFRAG_CANCEL_REQUIREMENTS = Cpus >= 8 

# Decide which slots can be drained
DEFRAG_REQUIREMENTS = PartitionableSlot && StartJobs =?= True && RalNodeOnline =?= True

## Logs
MAX_DEFRAG_LOG = 104857600
MAX_NUM_DEFRAG_LOG = 10

#DEFRAG_DEBUG = D_FULLDEBUG

#NEGOTIATOR_DEBUG        = D_FULLDEBUG

# Port limits
HIGHPORT = 65000
LOWPORT = 20000

# History
HISTORY = $(SPOOL)/history

 

  • File: /etc/ld.so.conf.d/condor.conf
  • Notes: CONDOR needed this to access its libraries. I had to run “ldconfig” to make it take hold.
  • Customise: Maybe not.
  • Content:
/usr/lib64/condor/
  • File: /usr/local/bin/scaling_factors_plugin.py
  • Notes: This implements another part of the scaling factor logic.
  • Customise: It should be generic.
  • Content:



  • File: /usr/local/bin/debugging_rte_plugin.py
  • Notes: Useful for capturing debug output.
  • Customise: It should be generic.
  • Content:
#!/usr/bin/python
# Copyright 2014 Science and Technology Facilities Council
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#  http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import re
from os.path import isfile
import shutil
import datetime
import time
import os


"""Usage: scaling_factors_plugin.py <status> <control dir> <jobid>

Authplugin for FINISHING STATE

Example:

  authplugin="FINISHING timeout=60,onfailure=pass,onsuccess=pass /usr/local/bin/scaling_factors_plugin.py %S %C %I"

"""

def ExitError(msg,code):
    """Print error message and exit"""
    from sys import exit
    print(msg)
    exit(code)

def GetScalingFactor(control_dir, jobid):

    errors_file = '%s/job.%s.errors' %(control_dir,jobid)

    if not isfile(errors_file):
       ExitError("No such errors file: %s"%errors_file,1)

    f = open(errors_file)
    errors = f.read()
    f.close()

    scaling = -1

    m = re.search('MATCH_EXP_MachineRalScaling = \"([\dE\+\-\.]+)\"', errors)
    if m:
       scaling = float(m.group(1))

    return scaling


def SetScaledTimes(control_dir, jobid):

    scaling_factor = GetScalingFactor(control_dir, jobid)

    diag_file = '%s/job.%s.diag' %(control_dir,jobid)


    if not isfile(diag_file):
       ExitError("No such errors file: %s"%diag_file,1)

    f = open(diag_file)
    lines = f.readlines()
    f.close()

    newlines = []

    types = ['WallTime=', 'UserTime=', 'KernelTime=']

    for line in lines:
       for type in types:
          if type in line and scaling_factor > 0:
             m = re.search('=(\d+)s', line)
             if m:
                scaled_time = int(float(m.group(1))*scaling_factor)
                line = type + str(scaled_time) + 's\n'

       newlines.append(line)

    fw = open(diag_file, "w")
    fw.writelines(newlines)
    fw.close()
    # Save a copy. Use this for the DAPDUMP analyser.
    #tstamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y%m%d%H%M%S')
    #dest = '/var/log/arc/diagfiles/' + tstamp + '_' + os.path.basename(diag_file)
    #shutil.copy2(diag_file, dest)

    return 0


def main():
    """Main"""

    import sys

    # Parse arguments

    if len(sys.argv) == 4:
        (exe, status, control_dir, jobid) = sys.argv
    else:
        ExitError("Wrong number of arguments\n"+__doc__,1)

    if status == "FINISHING":
        SetScaledTimes(control_dir, jobid)
        sys.exit(0)

    sys.exit(1)

if __name__ == "__main__":
    main()

XXX /usr/local/bin/debugging_rte_plugin.py

#!/usr/bin/python

# This copies the files containing useful output from completed jobs into a directory 

import shutil

"""Usage: debugging_rte_plugin.py <status> <control dir> <jobid>

Authplugin for FINISHED STATE

Example:

  authplugin="FINISHED timeout=60,onfailure=pass,onsuccess=pass /usr/local/bin/debugging_rte_plugin.py %S %C %I"

"""

def ExitError(msg,code):
    """Print error message and exit"""
    from sys import exit
    print(msg)
    exit(code)

def ArcDebuggingL(control_dir, jobid):

    from os.path import isfile
   
    try:
        m = open("/var/spool/arc/debugging/msgs", 'a') 
    except IOError ,  err:
        print err.errno 
        print err.strerror 


    local_file = '%s/job.%s.local' %(control_dir,jobid)
    grami_file = '%s/job.%s.grami' %(control_dir,jobid)

    if not isfile(local_file):
       ExitError("No such description file: %s"%local_file,1)

    if not isfile(grami_file):
       ExitError("No such description file: %s"%grami_file,1)

    lf = open(local_file)
    local = lf.read()
    lf.close()

    if 'Organic Units' in local or 'stephen jones' in local:
        shutil.copy2(grami_file, '/var/spool/arc/debugging')

        f = open(grami_file)
        grami = f.readlines()
        f.close()
    
        for line in grami:
            m.write(line)
            if 'joboption_directory' in line:
               comment = line[line.find("'")+1:line.find("'",line.find("'")+1)]+'.comment'
               shutil.copy2(comment, '/var/spool/arc/debugging')
            if 'joboption_stdout' in line:
               mystdout = line[line.find("'")+1:line.find("'",line.find("'")+1)]
               m.write("Try Copy mystdout - " + mystdout + "\n")
               if isfile(mystdout):
                 m.write("Copy mystdout - " + mystdout + "\n")
                 shutil.copy2(mystdout, '/var/spool/arc/debugging')
               else:
                 m.write("mystdout gone - " + mystdout + "\n")
            if 'joboption_stderr' in line:
               mystderr = line[line.find("'")+1:line.find("'",line.find("'")+1)]
               m.write("Try Copy mystderr - " + mystderr + "\n")
               if isfile(mystderr):
                 m.write("Copy mystderr - " + mystderr + "\n")
                 shutil.copy2(mystderr, '/var/spool/arc/debugging')
               else:
                 m.write("mystderr gone - " + mystderr + "\n")

    close(m)
    return 0

def main():
    """Main"""

    import sys

    # Parse arguments

    if len(sys.argv) == 4:
        (exe, status, control_dir, jobid) = sys.argv
    else:
        ExitError("Wrong number of arguments\n",1)

    if status == "FINISHED":
       ArcDebuggingL(control_dir, jobid)
       sys.exit(0)

    sys.exit(1)

if __name__ == "__main__":
    main()


  • File: /usr/local/bin/default_rte_plugin.py
  • Notes: Sets up the default run time environment.
  • Customise: It should be generic.
  • Content:
#!/usr/bin/python
# Copyright 2014 Science and Technology Facilities Council
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#  http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Usage: default_rte_plugin.py <status> <control dir> <jobid> <runtime environment>

Authplugin for PREPARING STATE

Example:

  authplugin="PREPARING timeout=60,onfailure=pass,onsuccess=pass /usr/local/bin/default_rte_plugin.py %S %C %I <rte>"

"""

def ExitError(msg,code):
    """Print error message and exit"""
    from sys import exit
    print(msg)
    exit(code)

def SetDefaultRTE(control_dir, jobid, default_rte):

    from os.path import isfile

    desc_file = '%s/job.%s.description' %(control_dir,jobid)

    if not isfile(desc_file):
       ExitError("No such description file: %s"%desc_file,1)

    f = open(desc_file)
    desc = f.read()
    f.close()

    if default_rte not in desc:
       with open(desc_file, "a") as myfile:
          myfile.write("( runtimeenvironment = \"" + default_rte + "\" )")

    return 0

def main():
    """Main"""

    import sys

    # Parse arguments

    if len(sys.argv) == 5:
        (exe, status, control_dir, jobid, default_rte) = sys.argv
    else:
        ExitError("Wrong number of arguments\n"+__doc__,1)

    if status == "PREPARING":
        SetDefaultRTE(control_dir, jobid, default_rte)
        sys.exit(0)

    sys.exit(1)

if __name__ == "__main__":
    main()


  • File: /etc/lcmaps/lcmaps.db
  • Notes: Connects the authentication layer to an ARGUS server
  • Customise: Yes. It must be changed to suit your site.
  • Content:
path = /usr/lib64/lcmaps

verify_proxy = "lcmaps_verify_proxy.mod"
                    "-certdir /etc/grid-security/certificates"
                    "--discard_private_key_absence"
                    "--allow-limited-proxy"

pepc = "lcmaps_c_pep.mod"
            "--pep-daemon-endpoint-url https://hepgrid9.ph.liv.ac.uk:8154/authz"
            "--resourceid http://authz-interop.org/xacml/resource/resource-type/arc"
            "--actionid http://glite.org/xacml/action/execute"
            "--capath /etc/grid-security/certificates/"
            "--certificate /etc/grid-security/hostcert.pem"
            "--key /etc/grid-security/hostkey.pem"

# Policies:
arc:
verify_proxy -> pepc


  • File: /etc/profile.d/env.sh
  • Notes: Sets up environment variables for specific VO jobs.
  • Customise: Yes. It must be changed to suit your site.
  • Content:
if [ "X${GLITE_ENV_SET+X}" = "X" ]; then
. /usr/libexec/grid-env-funcs.sh
if [ "x${GLITE_UI_ARCH:-$1}" = "x32BIT" ]; then arch_dir=lib; else arch_dir=lib64; fi
gridpath_prepend     "PATH" "/bin"
gridpath_prepend     "MANPATH" "/opt/glite/share/man"
gridenv_set         "VO_ZEUS_SW_DIR" "/opt/exp_soft_sl5/zeus"
gridenv_set         "VO_ZEUS_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_VO_SIXT_CERN_CH_SW_DIR" "/opt/exp_soft_sl5/sixt"
gridenv_set         "VO_VO_SIXT_CERN_CH_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_VO_NORTHGRID_AC_UK_SW_DIR" "/opt/exp_soft_sl5/northgrid"
gridenv_set         "VO_VO_NORTHGRID_AC_UK_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_T2K_ORG_SW_DIR" "/cvmfs/t2k.gridpp.ac.uk"
gridenv_set         "VO_T2K_ORG_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_SNOPLUS_SNOLAB_CA_SW_DIR" "/cvmfs/snoplus.gridpp.ac.uk"
gridenv_set         "VO_SNOPLUS_SNOLAB_CA_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_PLANCK_SW_DIR" "/opt/exp_soft_sl5/planck"
gridenv_set         "VO_PLANCK_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_PHENO_SW_DIR" "/opt/exp_soft_sl5/pheno"
gridenv_set         "VO_PHENO_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_OPS_SW_DIR" "/opt/exp_soft_sl5/ops"
gridenv_set         "VO_OPS_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_NEISS_ORG_UK_SW_DIR" "/opt/exp_soft_sl5/neiss"
gridenv_set         "VO_NEISS_ORG_UK_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_NA62_VO_GRIDPP_AC_UK_SW_DIR" "/cvmfs/na62.gridpp.ac.uk"
gridenv_set         "VO_NA62_VO_GRIDPP_AC_UK_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_MICE_SW_DIR" "/cvmfs/mice.gridpp.ac.uk"
gridenv_set         "VO_MICE_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_MAGIC_SW_DIR" "/opt/exp_soft_sl5/magic"
gridenv_set         "VO_MAGIC_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_LHCB_SW_DIR" "/cvmfs/lhcb.cern.ch"
gridenv_set         "VO_LHCB_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_LSST_SW_DIR" "/opt/exp_soft_sl5/lsst"
gridenv_set         "VO_LSST_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_ILC_SW_DIR" "/cvmfs/ilc.desy.de"
gridenv_set         "VO_ILC_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_HONE_SW_DIR" "/cvmfs/hone.gridpp.ac.uk"
gridenv_set         "VO_HONE_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_GRIDPP_SW_DIR" "/opt/exp_soft_sl5/gridpp"
gridenv_set         "VO_GRIDPP_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_GEANT4_SW_DIR" "/opt/exp_soft_sl5/geant4"
gridenv_set         "VO_GEANT4_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_FUSION_SW_DIR" "/opt/exp_soft_sl5/fusion"
gridenv_set         "VO_FUSION_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_ESR_SW_DIR" "/opt/exp_soft_sl5/esr"
gridenv_set         "VO_ESR_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_EPIC_VO_GRIDPP_AC_UK_SW_DIR" "/opt/exp_soft_sl5/epic"
gridenv_set         "VO_EPIC_VO_GRIDPP_AC_UK_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_DZERO_SW_DIR" "/opt/exp_soft_sl5/dzero"
gridenv_set         "VO_DZERO_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_DTEAM_SW_DIR" "/opt/exp_soft_sl5/dteam"
gridenv_set         "VO_DTEAM_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_CMS_SW_DIR" "/opt/exp_soft_sl5/cms"
gridenv_set         "VO_CMS_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_CERNATSCHOOL_ORG_SW_DIR" "/cvmfs/cernatschool.gridpp.ac.uk"
gridenv_set         "VO_CERNATSCHOOL_ORG_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_CDF_SW_DIR" "/opt/exp_soft_sl5/cdf"
gridenv_set         "VO_CDF_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_CAMONT_SW_DIR" "/opt/exp_soft_sl5/camont"
gridenv_set         "VO_CAMONT_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_CALICE_SW_DIR" "/opt/exp_soft_sl5/calice"
gridenv_set         "VO_CALICE_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_BIOMED_SW_DIR" "/opt/exp_soft_sl5/biomed"
gridenv_set         "VO_BIOMED_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_ATLAS_SW_DIR" "/cvmfs/atlas.cern.ch/repo/sw"
gridenv_set         "VO_ATLAS_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "VO_ALICE_SW_DIR" "/opt/exp_soft_sl5/alice"
gridenv_set         "VO_ALICE_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "SITE_NAME" "UKI-NORTHGRID-LIV-HEP"
gridenv_set         "SITE_GIIS_URL" "hepgrid4.ph.liv.ac.uk"
gridenv_set         "RFIO_PORT_RANGE" ""20000,25000""
gridenv_set         "MYPROXY_SERVER" "lcgrbp01.gridpp.rl.ac.uk"
gridenv_set         "LCG_LOCATION" "/usr"
gridenv_set         "LCG_GFAL_INFOSYS" "lcg-bdii.gridpp.ac.uk:2170,topbdii.grid.hep.ph.ic.ac.uk:2170"
gridenv_set         "GT_PROXY_MODE" "old"
gridenv_set         "GRID_ENV_LOCATION" "/usr/libexec"
gridenv_set         "GRIDMAPDIR" "/etc/grid-security/gridmapdir"
gridenv_set         "GLITE_LOCATION_VAR" "/var"
gridenv_set         "GLITE_LOCATION" "/usr"
gridenv_set         "GLITE_ENV_SET" "TRUE"
gridenv_set         "GLEXEC_LOCATION" "/usr"
gridenv_set         "DPNS_HOST" "hepgrid11.ph.liv.ac.uk"
gridenv_set         "DPM_HOST" "hepgrid11.ph.liv.ac.uk"
. /usr/libexec/clean-grid-env-funcs.sh
fi
  • File: /etc/grid-security/grid-mapfile
  • Notes: Useful for directly mapping a user for testing. Superseded by ARGUS now, so optional.
  • Customise: Yes. It must be changed to suit your site.
  • Content:
"/C=UK/O=eScience/OU=Liverpool/L=CSD/CN=stephen jones" dteam184
  • File: /root/glitecfg/site-info.def
  • Notes: Just a copy of the site standard SID file. Used to make the accounts.
  • Content: as per site standard
  • File: /opt/glite/yaim/etc/users.conf
  • Notes: Just a copy of the site standard users.conf file. Used to make the accounts.
  • Content: as per site standard
  • File: /opt/glite/yaim/etc/groups.conf
  • Notes: Just a copy of the site standard groups.conf file. Used to make the accounts.
  • Content: as per site standard
  • File: /root/glitecfg/vo.d
  • Notes: Just a copy of the site standard vo.d dir. Used to make the accounts.
  • Content: as per site standard
  • File: /etc/arc/runtime/ENV/PROXY
  • Notes: Same as the head node version; see above. Stops error messages of one kind or another
  • Content: empty
  • File: /etc/init.d/nordugrid-arc-egiis
  • Notes: Stops error messages of one kind or another
  • Content: empty

Head Cron jobs

I had to add these cron jobs, illustrated with puppet stanzas.

  • Cron: jura
  • Purpose: Run the jura APEL reporter now and again
  • Puppet stanza:
 cron { "jura":
   # DEBUG DEBUG DEBUG DEBUG
   #ensure => absent,
   command => "/usr/libexec/arc/jura /var/spool/arc/jobstatus &>> /var/log/arc/jura.log",
   user => root, hour => 6, minute => 16
 }
  • Cron: defrag
  • Purpose: Sets the defrag parameters dynamically
  • Puppet stanza:
 cron { "set_defrag_parameters.sh":
   command => "/root/scripts/set_defrag_parameters.sh >> /var/log/set_defrag_parameters.log",
   require => File["/root/scripts/set_defrag_parameters.sh"],
   user => root,
   minute   => "*/5",
   hour     => "*",
   monthday => "*",
   month    => "*",
   weekday  => "*",
 }

Head Special notes

  • After installing the Apel package, I had to make these changes by hand. On line 136 of the /usr/libexec/arc/ssmsend file, I had to add a parameter ; use_ssl = _use_ssl.
  • To set the GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxWallClockTime bdii publishing values, you need to change the lines involving GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxWallClockTime in /usr/share/arc/glue-generator.pl. For example:
GlueCEPolicyMaxCPUTime: 4320
GlueCEPolicyMaxWallClockTime: 4320

Notes on HEPSPEC Publishing Parameters

The basic process for publishing the HEPSPEC is similar to that used for TORQUE, and is described here: Publishing_tutorial. An alternative (but equivalent) explanation is here: http://northgrid-tech.blogspot.co.uk/2010/04/scaling-capacity-publishing-and.html

However, the Publishing_tutorial describes a situation where Yaim is used to convert and transfer the information. In this case, the same data has to be transposed into the arc.conf configuration file so that the ARC BDII can access and publish the values.

The following table shows how to map the YAIM values references in the tutorial to the relevant configuration settings in the ARC system.


Worker node hardware
Description Yaim variable ARC Conf Section Example ARC Variable Notes
Total physical cpus in cluster CE_PHYSCPU=114 N/A N/A No equivalent in ARC
Total cores/logical-cpus/unislots/threads... in cluster CE_LOGCPU=652 [cluster] and [queue/grid] totalcpus=652 Only 1 queue; same in both sections
Accounting Scaling CE_CAPABILITY="CPUScalingReferenceSI00=2500 ... [grid-manager] jobreport_options="... benchmark_value:2500.00" Provides the reference for accounting
Power of 1 logical cpu, in HEPSPEC * 250 (bogoSI00) CE_SI00 [infosys/glue12] cpu_scaling_reference_si00="2970" See Yaim Manual; equivalent to benchmark * 250


Cores: the average unislots in a physical cpu CE_OTHERDESCR=Cores=n.n, ... [infosys/glue12] processor_other_description="Cores=5.72 ..." Yaim var was shared with Benchmark (below)
Benchmark: The scaled power of a single core/logical-cpu/unislot/thread ... CE_OTHERDESCR=....,Benchmark=11.88-HEP-SPEC06 [infosys/glue12] processor_other_description="...,Benchmark=11.88-HEP-SPEC06" Yaim var was shared with Cores (above)



Once the system is operating, the following script can be used to test the published power of your site.

#!/usr/bin/perl

my @glasgow = qw ( svr010.gla.scotgrid.ac.uk  svr011.gla.scotgrid.ac.uk  svr014.gla.scotgrid.ac.uk  svr026.gla.scotgrid.ac.uk);
my @liverpoolCE = qw (hepgrid5.ph.liv.ac.uk hepgrid6.ph.liv.ac.uk hepgrid10.ph.liv.ac.uk hepgrid97.ph.liv.ac.uk );
my @liverpoolCE = qw (hepgrid2.ph.liv.ac.uk );

my $power = 0;
for my $server (@liverpoolCE  ) {
  my $p = getPower($server);
  $power = $power + $p;
}

print("Total power is $power\n");

sub getPower() {

  $bdii = "hepgrid2.ph.liv.ac.uk:2135";

  my $server = shift;

  open(CMD,"ldapsearch -LLL -x -h $bdii -b o=grid 'GlueSubClusterUniqueID=$server' |") or die("No get $server stuff");
  my $buf = ;
  my @lines;
  while (<CMD>) {
    chomp();
    if (/^ /) {
      s/^ //; $buf .= $_;
    }
    else {
      push(@lines,$buf); $buf = $_;
    }
  } 
  close(CMD);
  push(@lines,$buf);
  
  my $avgHepspec = -1;
  my $slots = -1;
  foreach my $l (@lines) {
    if ($l =~ /^GlueHostProcessorOtherDescription: Cores=([0-9\.]+),Benchmark=([0-9\.]+)-HEP-SPEC06/) {
      $avgHepspec = $2;
      print("avgHepspec -- $avgHepspec, $l\n");
    }
    if ($l =~ /^GlueSubClusterLogicalCPUs: ([0-9]+)/) {
      $slots = $1;
      print("slots      -- $slots\n");
    }
  }
  
  die("Reqd val not found $avgHepspec $slots \n") if (($avgHepspec == -1) or ($slots == -1));

  my $power =  $avgHepspec * $slots;
  print("power avgHepspec slots, $power, $avgHepspec, $slots\n");
  return $power;
}

Install the LSC Files

I used VomsSnooper to do this as follows.

# cd /opt/GridDevel/vomssnooper/usecases/getLSCRecords  
# sed -i -e \"s/ vomsdir/ \/etc\/grid-security\/vomsdir/g\" getLSCRecords.sh
# ./getLSCRecords.sh

Yaim to make head user accounts, /etc/vomses file and glexec.conf etc.

I used Yaim to do this as follows.

# yaim  -r -s /root/glitecfg/site-info.def -n ABC -f config_users
# yaim  -r -s /root/glitecfg/site-info.def -n ABC -f config_vomses
# /opt/glite/yaim/bin/yaim -c -s /root/glitecfg/site-info.def -n GLEXEC_wn 

For this to work, ap priori, the site-info.def file must be present. A users.conf file and a groups.conf file must exist in the /opt/glite/yaim/etc/ directory. This is usually a part of any grid system CE install, but advice on how to prepare these is given in this Yaim guide (that I hope will be maintained for a little while longer.)

https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide400

Head Services

I had to set some services running.

A-rex - the ARC CE service
condor - the CONDOR batch system service
nordugrid-arc-ldap-infosys – part of the bdii
nordugrid-arc-slapd – part of the bdii
nordugrid-arc-bdii – part of the bdii
gridftpd – the gridftp service


And that was it. That's all I did to get the server working, as far as I can recall.

Worker Node

Worker Standard build

As for the headnode, the basis for the initial worker node build follows the standard model for any workernode at Liverpool, prior to the installation of any middleware. Such a baseline build might include networking, cvmfs, iptables, nagios scripts, ganglia, ssh etc. 0

Aside: After an installation mistake, it was discovered that an ordinary TORQUE workernode could be used as the basis of the build, and it would then be possible to use the same worker node on both ARC/CONDOR and CREAM/TORQUE systems, but not simultaneously. This idea was not persued, however.

Worker Extra Directories

I needed to make these directories:

/root/glitecfg
/opt/exp_soft_sl5/
/etc/arc/runtime/ENV
/etc/condor/config.d
/etc/grid-security/gridmapdir
/etc/arc/runtime/ENV

On the liverpool cluster, we have VO software areas under:

/opt/exp_soft_sl5

On our system, this is actually a mount point to a central location. CVMFS takes over this role now, but it might be necessary to set up a shared mount system such as this and point the VO software directories to it, as shown in the head node file /etc/profile.d/env.sh (see above.)

Worker Additional Packages

We had to install the main CONDOR package:

condor

We also had to install some various bits of extra middleware:

emi-wn     # for glite-brokerinfo (at least)
lcg-util
lcg-util-libs
lcg-util-python
lfc-devel
lfc
lfc-perl
lfc-python
uberftp
voms-clients3
voms
gfal2-plugin-lfc
HEP_OSlibs_SL6

These libraries were also needed:

libXft-devel
libxml2-devel
libXpm-devel

We also installed some things, mostly for various VOs, I think:

bzip2-devel
compat-gcc-34-c++
compat-gcc-34-g77
gcc-c++
gcc-gfortran
git
gmp-devel
imake
ipmitool
libgfortran
liblockfile-devel
ncurses-devel
python

Worker Files

  • File: /root/scripts/set_node_parameters.pl
  • Notes: This script senses the type of the system and sets it up according to how many slots it has etc.You'll also have to make arrangements to run this script once when you setup the machine. On the liverpool system, this is done with the following puppet stanza.
exec { "set_node_parameters.pl": command =>  "/root/scripts/set_node_parameters.pl > /etc/condor/config.d/00-node_parameters; \
/bin/touch /root/scripts/done-set_node_parameters.pl", require => [ File["/root/scripts/set_node_parameters.pl"], 
File["/etc/condor/config.d"] ], onlyif => "/usr/bin/test ! -f /root/scripts/done-set_node_parameters.pl", timeout => "86400" }
  • Customise: Yes. You'll need to edit it it to suit your site.
  • Content:
#!/usr/bin/perl

use strict;

open(CPUINFO,"/proc/cpuinfo") or die("Can't open /proc/cpuinfo, $?");
while(<CPUINFO>) {
  if (/model name/) {
    s/.*CPU\s*//;s/\s.*//;
    if (/E5620/){ 
      print ("RalNodeLabel = E5620\n");
      print ("RalScaling =  1.205\n");
      print ("NUM_SLOTS = 1\n");
      print ("SLOT_TYPE_1               = cpus=10,mem=auto,disk=auto\n");
      print ("NUM_SLOTS_TYPE_1          = 1\n");
      print ("SLOT_TYPE_1_PARTITIONABLE = TRUE\n");
      close(CPUINFO); exit(0);
    }
    elsif (/L5420/){ 
      print ("RalNodeLabel = L5420\n");
      print ("RalScaling =  0.896\n");
      print ("NUM_SLOTS = 1\n");
      print ("SLOT_TYPE_1               = cpus=8,mem=auto,disk=auto\n");
      print ("NUM_SLOTS_TYPE_1          = 1\n");
      print ("SLOT_TYPE_1_PARTITIONABLE = TRUE\n");
      close(CPUINFO); exit(0);
    }
    elsif (/X5650/){ 
      print ("RalNodeLabel = X5650\n");
      print ("RalScaling =  1.229\n");
      print ("NUM_SLOTS = 1\n");
      print ("SLOT_TYPE_1               = cpus=16,mem=auto,disk=auto\n");
      print ("NUM_SLOTS_TYPE_1          = 1\n");
      print ("SLOT_TYPE_1_PARTITIONABLE = TRUE\n");
      close(CPUINFO); exit(0);
    }
    elsif (/E5-2630/){ 
      print ("RalNodeLabel = E5-2630\n");
      print ("RalScaling =  1.386\n");
      print ("NUM_SLOTS = 1\n");
      print ("SLOT_TYPE_1               = cpus=18,mem=auto,disk=auto\n");
      print ("NUM_SLOTS_TYPE_1          = 1\n");
      print ("SLOT_TYPE_1_PARTITIONABLE = TRUE\n");
      close(CPUINFO); exit(0);
    }
    else {
      print ("RalNodeLabel = BASELINE\n");
      print ("RalScaling =  1.0\n"); 
      print ("NUM_SLOTS = 1\n");
      print ("SLOT_TYPE_1               = cpus=8,mem=auto,disk=auto\n");
      print ("NUM_SLOTS_TYPE_1          = 1\n");
      print ("SLOT_TYPE_1_PARTITIONABLE = TRUE\n");
      close(CPUINFO); exit(0);
    }
  }
}


  • File: /etc/condor/condor_config.local
  • Notes: The main client condor configuration custom file.
  • Customise: Yes. You'll need to edit it to suit your site.
  • Content:
##  What machine is your central manager?

CONDOR_HOST = hepgrid2.ph.liv.ac.uk

## Pool's short description

COLLECTOR_NAME = Condor at $(FULL_HOSTNAME)

## Put the output in a huge dir

EXECUTE = /data/condor_pool/

##  Make it switchable when this machine is willing to start a job 

ENABLE_PERSISTENT_CONFIG = TRUE
PERSISTENT_CONFIG_DIR = /etc/condor/ral
STARTD_ATTRS = $(STARTD_ATTRS) StartJobs, RalNodeOnline
STARTD.SETTABLE_ATTRS_ADMINISTRATOR = StartJobs 
StartJobs = False
RalNodeOnline = False

START = ((StartJobs =?= True) && (RalNodeOnline =?= True))

##  When to suspend a job?

SUSPEND = FALSE

##  When to nicely stop a job?
# When a job is running and the PREEMPT expression evaluates to True, the 
# condor_startd will evict the job. The PREEMPT expression s hould reflect the 
# requirements under which the machine owner will not permit a job to continue to run. 
# For example, a policy to evict a currently running job when a key is hit or when 
# it is the 9:00am work arrival time, would be expressed in the PREEMPT expression 
# and enforced by the condor_startd. 

PREEMPT = FALSE

# If there is a job from a higher priority user sitting idle, the 
# condor_negotiator daemon may evict a currently running job submitted 
# from a lower priority user if PREEMPTION_REQUIREMENTS is True.

PREEMPTION_REQUIREMENTS = FALSE

# No job has pref over any other

#RANK = FALSE

##  When to instantaneously kill a preempting job
##  (e.g. if a job is in the pre-empting stage for too long)

KILL = FALSE

##  This macro determines what daemons the condor_master will start and keep its watchful eyes on.
##  The list is a comma or space separated list of subsystem names

DAEMON_LIST = MASTER, STARTD

ALLOW_WRITE = *

#######################################
# scaling 
#

STARTD_ATTRS = $(STARTD_ATTRS) RalScaling RalNodeLabel

#######################################
# Andrew Lahiff's tip for over committing memory

#MEMORY = 1.35 * quantize( $(DETECTED_MEMORY), 1000 )
MEMORY = 2.2 * quantize( $(DETECTED_MEMORY), 1000 )

#######################################
# Andrew Lahiff's security

ALLOW_WRITE = 

UID_DOMAIN = ph.liv.ac.uk

CENTRAL_MANAGER1 = hepgrid2.ph.liv.ac.uk
COLLECTOR_HOST = $(CENTRAL_MANAGER1)

# Central managers
CMS = condor_pool@$(UID_DOMAIN)/hepgrid2.ph.liv.ac.uk

# CEs
CES = condor_pool@$(UID_DOMAIN)/hepgrid2.ph.liv.ac.uk

# Worker nodes
WNS = condor_pool@$(UID_DOMAIN)/192.168.*

# Users
USERS = *@$(UID_DOMAIN)
USERS = *

# Required for HA
HOSTALLOW_NEGOTIATOR = $(COLLECTOR_HOST)
HOSTALLOW_ADMINISTRATOR = $(COLLECTOR_HOST)
HOSTALLOW_NEGOTIATOR_SCHEDD = $(COLLECTOR_HOST)

# Authorization
HOSTALLOW_WRITE =
ALLOW_READ = */*.ph.liv.ac.uk
NEGOTIATOR.ALLOW_WRITE = $(CES), $(CMS)
COLLECTOR.ALLOW_ADVERTISE_MASTER = $(CES), $(CMS), $(WNS)
COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(CES)
COLLECTOR.ALLOW_ADVERTISE_STARTD = $(WNS)
SCHEDD.ALLOW_WRITE = $(USERS)
SHADOW.ALLOW_WRITE = $(WNS), $(CES)
ALLOW_DAEMON = condor_pool@$(UID_DOMAIN)/*.ph.liv.ac.uk, $(FULL_HOSTNAME)
ALLOW_ADMINISTRATOR = root@$(UID_DOMAIN)/$(IP_ADDRESS), condor_pool@$(UID_DOMAIN)/$(IP_ADDRESS), $(CMS)
ALLOW_CONFIG = root@$(FULL_HOSTNAME)

# Temp debug
#ALLOW_WRITE = $(FULL_HOSTNAME), $(IP_ADDRESS), $(CONDOR_HOST)


# Don't allow nobody to run jobs
SCHEDD.DENY_WRITE = nobody@$(UID_DOMAIN)

# Authentication
SEC_PASSWORD_FILE = /etc/condor/pool_password
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_READ_AUTHENTICATION = OPTIONAL
SEC_CLIENT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = PASSWORD,FS
SCHEDD.SEC_WRITE_AUTHENTICATION_METHODS = FS,PASSWORD
SCHEDD.SEC_DAEMON_AUTHENTICATION_METHODS = FS,PASSWORD
SEC_CLIENT_AUTHENTICATION_METHODS = FS,PASSWORD,CLAIMTOBE
SEC_READ_AUTHENTICATION_METHODS = FS,PASSWORD,CLAIMTOBE

# Integrity
SEC_DEFAULT_INTEGRITY  = REQUIRED
SEC_DAEMON_INTEGRITY = REQUIRED
SEC_NEGOTIATOR_INTEGRITY = REQUIRED

# Separation
USE_PID_NAMESPACES = False

# Smooth updates
MASTER_NEW_BINARY_RESTART = PEACEFUL

# Give jobs 3 days
MAXJOBRETIREMENTTIME = 3600 * 24 * 3

# Port limits
HIGHPORT = 65000
LOWPORT = 20000

# Startd Crons
STARTD_CRON_JOBLIST=TESTNODE
STARTD_CRON_TESTNODE_EXECUTABLE=/usr/libexec/condor/scripts/testnodeWrapper.sh
STARTD_CRON_TESTNODE_PERIOD=300s

# Make sure values get over
STARTD_CRON_AUTOPUBLISH = If_Changed

# One job per claim
CLAIM_WORKLIFE = 0
  • File: /etc/profile.d/liv-lcg-env.sh
  • Notes: Some environment script needed by the system.
  • Customise: Yes. You'll need to edit it it to suit your site.
  • Content:
export ATLAS_RECOVERDIR=/data/atlas
EDG_WL_SCRATCH=$TMPDIR

ID=`id -u`

if [ $ID -gt 19999 ]; then
  ulimit -v 10000000
fi


  • File: /etc/profile.d/liv-lcg-env.csh
  • Notes: Some other environment script needed by the system.
  • Customise: Yes. You'll need to edit it it to suit your site.
  • Content:
setenv ATLAS_RECOVERDIR /data/atlas
if ( "$?TMPDIR" == "1" ) then
setenv EDG_WL_SCRATCH $TMPDIR
else
setenv EDG_WL_SCRATCH ""
endif



  • File: /etc/arc/runtime/ENV/PROXY
  • Notes: Same as the head node version; see above. Stops error messages of one kind or another
  • Content: empty
  • File: /usr/etc/globus-user-env.sh
  • Notes: Jobs just need it to be there.
  • Content: empty
  • File: /etc/arc/runtime/ENV/GLITE
  • Notes: Same as the head node version; see above. The GLITE runtime environment.
  • Content: empty
#!/bin/sh

#export LD_LIBRARY_PATH=/opt/xrootd/lib
export GLOBUS_LOCATION=/usr

if [ "x$1" = "x0" ]; then
  # Set environment variable containing queue name
  env_idx=0
  env_var="joboption_env_$env_idx"
  while [ -n "${!env_var}" ]; do
     env_idx=$((env_idx+1))
     env_var="joboption_env_$env_idx"
  done 
  eval joboption_env_$env_idx="NORDUGRID_ARC_QUEUE=$joboption_queue"
	
  export RUNTIME_ENABLE_MULTICORE_SCRATCH=1

fi

if [ "x$1" = "x1" ]; then
  # Set grid environment
  if [ -e /etc/profile.d/env.sh ]; then
     source /etc/profile.d/env.sh
  fi 
  if [ -e /etc/profile.d/zz-env.sh ]; then
     source /etc/profile.d/zz-env.sh
  fi
  export LD_LIBRARY_PATH=/opt/xrootd/lib

  # Set basic environment variables
  export GLOBUS_LOCATION=/usr
  HOME=`pwd`
  export HOME
  USER=`whoami`
  export USER
  HOSTNAME=`hostname -f`
  export HOSTNAME
fi

export VO_ALICE_SW_DIR=/opt/exp_soft_sl5/alice
export VO_ATLAS_SW_DIR=/cvmfs/atlas.cern.ch/repo/sw
export VO_BIOMED_SW_DIR=/opt/exp_soft_sl5/biomed
export VO_CALICE_SW_DIR=/opt/exp_soft_sl5/calice
export VO_CAMONT_SW_DIR=/opt/exp_soft_sl5/camont
export VO_CDF_SW_DIR=/opt/exp_soft_sl5/cdf
export VO_CERNATSCHOOL_ORG_SW_DIR=/opt/exp_soft_sl5/cernatschool
export VO_CMS_SW_DIR=/opt/exp_soft_sl5/cms
export VO_DTEAM_SW_DIR=/opt/exp_soft_sl5/dteam
export VO_DZERO_SW_DIR=/opt/exp_soft_sl5/dzero
export VO_EPIC_VO_GRIDPP_AC_UK_SW_DIR=/opt/exp_soft_sl5/epic
export VO_ESR_SW_DIR=/opt/exp_soft_sl5/esr
export VO_FUSION_SW_DIR=/opt/exp_soft_sl5/fusion
export VO_GEANT4_SW_DIR=/opt/exp_soft_sl5/geant4
export VO_GRIDPP_SW_DIR=/opt/exp_soft_sl5/gridpp
export VO_HONE_SW_DIR=/cvmfs/hone.egi.eu
export VO_HYPERK_ORG_SW_DIR=/cvmfs/hyperk.egi.eu
export VO_ILC_SW_DIR=/cvmfs/ilc.desy.de
export VO_LHCB_SW_DIR=/cvmfs/lhcb.cern.ch
export VO_LSST_SW_DIR=/opt/exp_soft_sl5/lsst
export VO_MAGIC_SW_DIR=/opt/exp_soft_sl5/magic
export VO_MICE_SW_DIR=/cvmfs/mice.egi.eu
export VO_NA62_VO_GRIDPP_AC_UK_SW_DIR=/cvmfs/na62.egi.eu
export VO_NEISS_ORG_UK_SW_DIR=/opt/exp_soft_sl5/neiss
export VO_OPS_SW_DIR=/opt/exp_soft_sl5/ops
export VO_PHENO_SW_DIR=/opt/exp_soft_sl5/pheno
export VO_PLANCK_SW_DIR=/opt/exp_soft_sl5/planck
export VO_SNOPLUS_SNOLAB_CA_SW_DIR=/cvmfs/snoplus.egi.eu
export VO_T2K_ORG_SW_DIR=/cvmfs/t2k.egi.eu
export VO_VO_NORTHGRID_AC_UK_SW_DIR=/opt/exp_soft_sl5/northgrid
export VO_VO_SIXT_CERN_CH_SW_DIR=/opt/exp_soft_sl5/sixt
export VO_ZEUS_SW_DIR=/opt/exp_soft_sl5/zeus

export RUCIO_HOME=/cvmfs/atlas.cern.ch/repo/sw/ddm/rucio-clients/0.1.12
export RUCIO_AUTH_TYPE=x509_proxy 

export LCG_GFAL_INFOSYS="lcg-bdii.gridpp.ac.uk:2170,topbdii.grid.hep.ph.ic.ac.uk:2170"


  • File: /etc/condor/pool_password
  • Notes: Will have its own section (TBD)
  • Customise: Yes.
  • Content: The content is the same as the one on the head node (see above).


 Password Authentication
 The password method provides mutual authentication through the use of a shared 
 secret. This is  often a good choice when strong security is desired, but an existing 
 Kerberos or X.509 infrastructure is not in place. Password authentication
 is available on both Unix andWindows. It currently can only be used for daemon
 -to-daemon authentication. The shared secret in this context is referred to as 
 the pool password. Before a daemon can use password authentication, the pool 
 password must be stored on the daemon’s local machine. On Unix, the password will 
 be placed in a file defined by the configuration variable SEC_PASSWORD_FILE. This file 
 will be accessible only by the UID that HTCondor is started as. OnWindows, the same 
 secure password store that is used for user passwords will be used for the pool 
 password (see section 7.2.3). Under Unix, the password file can be generated by 
 using the following command to write directly to the password file:
 condor_store_cred -f /path/to/password/file
 
  • File: /root/glitecfg/site-info.def
  • Notes: Just a copy of the site standard SID file. Used to make the accounts.
  • Content: as per site standard
  • File: /opt/glite/yaim/etc/users.conf
  • Notes: Just a copy of the site standard users.conf file. Used to make the accounts.
  • Content: as per site standard
  • File: /opt/glite/yaim/etc/groups.conf
  • Notes: Just a copy of the site standard groups.conf file. Used to make the accounts.
  • Content: as per site standard
  • File: /root/glitecfg/vo.d
  • Notes: Just a copy of the site standard vo.d dir. Used to make the accounts.
  • Content: as per site standard
  • File: /etc/lcas/lcas-glexec.db
  • Notes: Stops yaim from complaining about missing file
  • Content: empty

Worker Cron jobs

We run a cronjob to keep cvmfs clean:

0 5 */3 * * /root/bin/cvmfs_fsck.sh >> /var/log/cvmfs_fsck.log 2>&1

Worker Special notes

None to speak of (yet).

Worker user accounts

As with the head node, I used Yaim to do this as follows.

# yaim  -r -s /root/glitecfg/site-info.def -n ABC -f config_users

For this to work, ap priori, a users.conf file and a groups.conf file must exist in the /opt/glite/yaim/etc/ directory. This is usually a part of any grid system CE install, but advice on how to prepare these is given in this Yaim guide (that I hope will be maintained for a little while longer.)

https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide400

Worker Services

You have to set this service running:

condor

Workernode Health Check Script

This is a script that makes some checks on the worker node and "turns it off" if it fails any of them. To implement this, use a CONDOR feature; startd_cron jobs.

This config in the /etc/condor_config.local file on my worker nodes defines a some new configuration variables.

ENABLE_PERSISTENT_CONFIG = TRUE
PERSISTENT_CONFIG_DIR = /etc/condor/ral
STARTD_ATTRS = $(STARTD_ATTRS) StartJobs, RalNodeOnline
STARTD.SETTABLE_ATTRS_ADMINISTRATOR = StartJobs
StartJobs = False
RalNodeOnline = False

The prefix "Ral" is used here because some of this material is inherited from Andrew Lahiff at RAL. It's just to de-conflict names.

Anyway, the first section says to keep a persistent record of configuration settings; it adds new configuration settings called "StartJobs" and “RalNodeOnline”; it sets them initially to False; and it makes the START configuration setting dependant upon them both being set. Note: the START setting is very important because the node won't start jobs unless it is True.

Next, this config also in the /etc/condor_config.local file tells the system (startd) to run a cron script every five minutes.

STARTD_CRON_JOBLIST=TESTNODE
STARTD_CRON_TESTNODE_EXECUTABLE=/usr/libexec/condor/scripts/testnodeWrapper.sh
STARTD_CRON_TESTNODE_PERIOD=300s

# Make sure values get over
STARTD_CRON_AUTOPUBLISH = If_Changed

The testnodeWrapper.sh script looks like this:

#!/bin/bash

MESSAGE=OK

/usr/libexec/condor/scripts/testnode.sh > /dev/null 2>&1
STATUS=$?

if [ $STATUS != 0 ]; then
  MESSAGE=`grep ^[A-Z0-9_][A-Z0-9_]*=$STATUS\$ /usr/libexec/condor/scripts/testnode.sh | head -n 1 | sed -e "s/=.*//"`
  if  -z "$MESSAGE" ; then
    MESSAGE=ERROR
  fi
fi

if  $MESSAGE =~ ^OK$  ; then
  echo "RalNodeOnline = True"
else
  echo "RalNodeOnline = False"
fi
echo "RalNodeOnlineMessage = $MESSAGE"

echo `date`, message $MESSAGE >> /tmp/testnode.status
exit 0

This just wraps an existing script which I reuse from our TORQUE/MAUI cluster. The existing script just returns a non-zero code if any error happens. To add a bit of extra info, it also looks up the meaning of the code. The important thing to notice is that it echoes out a line to set the RalNodeOnline setting to false. This is then used in the setting of START. Note: on TORQUE/MAUI, the script ran as “root”; here it runs as “condor”. It uses sudo for some of the sections which (e.g.) check disks etc. because condor could not get smartctl settings etc.

When a node fails the test, START goes to False and the node won't run more jobs.

Note that we use two settings to control START. As well as RalNodeOnline, we have the StartJobs setting. We can control this independently, so we can turn a node offline whether or not it has an error. This is useful for stopping the node to (say) rebuild it. It's done on the server, like this.

condor_config_val -verbose -name r21-n01 -startd -set "StartJobs = false"
condor_reconfig r21-n01
condor_reconfig -daemon startd r21-n01

GOCDB Entries and Registration

Add new service entries for the head node in GOCDB for the following service types.

  • gLite-APEL
  • gLExec
  • ARC-CE

It is safe to monitor all these services, once they are marked in production.

Also contact representatives of the big experiments and tell them about the new CE. Ask Atlas to add the new CE in its analysis, production and multicore job queues.

Notes on Scaling, and Publishing/Accounting

Republishing Accounting Records

Republishing records from ARC is only possible for APEL if archiving option was set up in the arc.conf (see above for the settings). If this was set for the period covered, you can use the script below (called merge-and-create-publish.sh, and written by Jernej Porenta) for collecting the relevant archived records and putting them in the republishing directory. After doing this, you can run jura publishing in the normal manner, or wait for the cron job to kick off. You must set the following attributes in the script before running it.

  • archiving directory
  • required data gap
  • output directory for a new file
#!/bin/bash

# Script to create republish data for JURA from archive dir

# JURA archive dir, where all the old accounting records from ARC are saved (archiving setting from jobreport_options in arc.conf)
ARCHIVEDIR="/var/run/arc/urs/"

# Time frame of republish data
FROM="28-Feb-2015"
TO="02-Apr-2015"

# Output directory for new files, which should go into JURA outgoing dir (usually: /var/spool/arc/ssm/<APEL server>/outgoing/00000000
OUTPUT="/var/spool/arc/ssm/mq.cro-ngi.hr/outgoing/00000000/"

#####

TMPFILE="file.$$"

if [ ! -d $OUTPUT ] || [ ! -d $ARCHIVEDIR ]; then
        echo "Output or Archive dir is missing"
        exit 0
fi


# find all accounting records from archive dir with modifiation time in the specified timeframe and paste the records into temporary file
find $ARCHIVEDIR -type f -name 'usagerecordCAR.*' -newermt "$FROM -1 sec" -and -not -newermt "$TO -1 sec" -printf "%C@ %p\n" | sort | awk '{ print $2 }' | xargs -L1 -- grep -h UsageRecord >> $TMPFILE

# fix issues with missing CpuDuration
perl -p -i -e 's|WallDuration><ServiceLevel|WallDuration><CpuDuration urf:usageType="all">PT0S</CpuDuration><ServiceLevel|' $TMPFILE

# split the temporary file into smaller files with only 999 accounting records each
split -a 4 -l 999 -d $TMPFILE $OUTPUT/

# rename the files into format that JURA publisher will understand
for F in `find $OUTPUT -type f`; do
        FILE=`basename $F`
        NEWFILE=`date -d "$FROM + $FILE second" +%Y%m%d%H%M%S`
        mv -v $OUTPUT/$FILE $OUTPUT/$NEWFILE
done

# prepend XML tags for accounting files
find $OUTPUT -type f -print0 | xargs -0 -L1 -- sed -i '1s/^/<?xml version="1.0"?>\n<UsageRecords xmlns="http:\/\/eu-emi.eu\/namespaces\/2012\/11\/computerecord">\n/'

# attach XML tags for accounting files
for file in `find $OUTPUT -type f`; do
        echo "</UsageRecords>" >> $file
done

rm -f $TMPFILE

echo "Publish files are in $OUTPUT directory"

Setting up Publishing

This section is based on a publishing tutorial written some time ago: Publishing_tutorial.

The salient points in this document explain (A) how to apply scaling factors to individual nodes in a mixed cluster and (B) how total power of a site is transmitted. I'll first lay out how it was done in CREAM/TORQUE/MAUI and then explain the changes required to make it relates to ARC/CONDOR.

Historical Set-up with CREAM/TORQUE/MAUI

Application of Scaling Factors

At Liverpool, we introduced an abstract node-type, called BASELINE, with a reference value of 10 HEPSPEC. This is transmitted to the information system on a per CE basis, and can be seen as follows.

$ ldapsearch -LLL -x -h hepgrid4:2170 -b o=grid GlueCEUniqueID=hepgrid5.ph.liv.ac.uk:8443/cream-pbs-long GlueCECapability | perl -p0e 's/\n //g'

GlueCECapability: CPUScalingReferenceSI00=2500

All CE's share the same value. Note: The value of 2500 corresponds to 10 HEPSPEC expressed in “bogoSpecInt2k” (which is equal to 1/250th of a HEPSPEC).

All real nodes receive a TORQUE scaling factor that describes how powerful their slots are relative to the abstract reference. For example, a machine with slightly less powerful slots than BASELINE might have a factor of 0.896. TORQUE then automatically normalises cpu durations with the scaling factor. Thus the accounting system merely needs to know the CPUScalingReferenceSI00 value to be able to compute work done.

Transmit Total Power of a Site

The total power of a site is conveyed to the information system by sending out values for Total Logical Cpus (or unislots) and Benchmark (average power of a single slot) and multiplying them together. It is done on a per CE basis, and the calculation at Liverpool (which then had 4 CREAM CEs) looks like this:

$ ldapsearch -LLL -x -h hepgrid4:2170 -b o=grid GlueSubClusterUniqueID=hepgrid5.ph.liv.ac.uk GlueSubClusterLogicalCPUs GlueHostProcessorOtherDescription | perl -p0e 's/\n //g'

GlueSubClusterLogicalCPUs: 1
GlueHostProcessorOtherDescription: Cores=6.23,Benchmark=12.53-HEP-SPEC06
$ ldapsearch -LLL -x -h hepgrid4:2170 -b o=grid GlueSubClusterUniqueID=hepgrid6.ph.liv.ac.uk GlueSubClusterLogicalCPUs GlueHostProcessorOtherDescription | perl -p0e 's/\n //g'
GlueSubClusterLogicalCPUs: 1
GlueHostProcessorOtherDescription: Cores=6.23,Benchmark=12.53-HEP-SPEC06
$ ldapsearch -LLL -x -h hepgrid4:2170 -b o=grid GlueSubClusterUniqueID=hepgrid10.ph.liv.ac.uk GlueSubClusterLogicalCPUs GlueHostProcessorOtherDescription | perl -p0e 's/\n //g'
GlueSubClusterLogicalCPUs: 1
GlueHostProcessorOtherDescription: Cores=6.23,Benchmark=12.53-HEP-SPEC06
$ ldapsearch -LLL -x -h hepgrid4:2170 -b o=grid GlueSubClusterUniqueID=hepgrid97.ph.liv.ac.uk GlueSubClusterLogicalCPUs GlueHostProcessorOtherDescription | perl -p0e 's/\n //g'
GlueSubClusterLogicalCPUs: 1381
GlueHostProcessorOtherDescription: Cores=6.23,Benchmark=12.53-HEP-SPEC06
$ bc -l
(1 + 1 + 1 + 1381) * 12.53

Giving 17341.52 HEPSPEC

Note: All 1384 nodes are/were available to each CE to submit to, but the bulk is allocated for hepgrid97 for the purposes of power publishing only.

The Setup with ARC/CONDOR

Application of Scaling Factors

There's an ARC “authplugin” script called scaling_factors_plugin.py, that gets run when a job finishes. It normalises the accounting. It gets a MachineRalScaling (that has been buried in an “errors” file. See “RalScaling” below) then parses the diag file, multiplying the run-times by the factor.

Also in ARC is a “jobreport_options” parameter that contains (e.g.) “benchmark_value:2500.00". I assume this is the equivalent of the “GlueCECapability: CPUScalingReferenceSI00=2500 ” in the “Application of Scaling Factors” section above, i.e. it is in bogospecint2k (250 * HEPSPEC). I assume that it represents the power of the reference node type, i.e. the power to which all the other nodes relate by way of their individual scaling factor.

The next thing considered is this RalScaling / MachineRalScaling mechanism. This is set in one of the config files on the WNs:

RalScaling = 2.14
STARTD_ATTRS = $(STARTD_ATTRS) RalScaling

It tells the node how powerful it is by setting a new variable with some arbitrary name. This goes on the ARC CE:

MachineRalScaling = "$$([ifThenElse(isUndefined(RalScaling), 1.00, RalScaling)])"
SUBMIT_EXPRS = $(SUBMIT_EXPRS) MachineRalScaling

This gets hold of the RalScaling variable on the WN, then passes it through via the SUBMIT_EXPRS parameter. It winds up the the “errors” file, which is then used in a normalisation script. Note that the scaling factor is applied top the workernode at build time by the set_node_parameters.pl script described in the Files section above.


Transmit Total Power of a Site

At present, there is no mechanism for that as far as I know.

Tests and Testing

The following URL lists some critical tests for ATLAS, and the Liverpool site. You'll have to modify the site name.

http://dashb-atlas-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=serviceavl&time[]=last48&granularity[]=default&profile=ATLAS_CRITICAL&group=All+sites&site[]=UKI-NORTHGRID-LIV-HEP&flavour[]=All+Service+Flavours&flavour[]=ARC-CE&disabledFlavours=true

To check the UK job submission status:

http://bigpanda.cern.ch/dash/production/?cloudview=region&computingsite=*MCORE*#cloud_UK 

Defragmentation for multicore jobs

In this section, I discuss two approaches to defragmenting a cluster to make room for multi-core jobs.

DrainBoss

DrainBoss is a relatively new development. For details on the traditional set-up, see the next section.

Introduction to DrainBoss

If all jobs on a cluster require the same number of CPUs, e.g. all need one, or all need two etc., then you can simple load up each node with jobs until it is full and no more. When one jobs ends, another can use its slot. But a problem occurs when you try to run jobs which vary in the number of cpus they require. Consider when a node has (say) eight cores, and it running eight single core jobs. One is the first to end, and a slot becomes free. But let us say that the highest priority job in the queue is an eight core job. The newly freed slot is not wide enough to take it, so it has to wait. Should the scheduler use the slot for a waiting single core job, or hold it back for the other seven jobs to end? It it holds jobs back, then resources are wasted. If it pops another single core job into running, then the multicore job has no prospect of ever running. The solution that Condor provides to the problem has two rules: start multicore jobs in preference to single core jobs, and periodically drain down nodes so that a multicore job can fit on them. The is implemented using the Condor DEFRAG daemon. This has parameters, described in the section below, which control the way nodes are selected and drained for multicore jobs. DrainBoss provides functionality for a similar approach but has a the additional features of a process controller that is used to sense the condition of the cluster and adjust the way nodes are drained and put back into service in a way that provides a certain amount of predictability.

Process controller principles

A process controller provides a feedback control system. It measures some variable, and compares this to some ideal value, called a setpoint, finding the error. It corrects the process to try to bring the error to the setpoint, eliminating the error. There are a large number of algorithms used to compute the correction, but DrainBoss makes use of the first two terms of the well-known Proportional Integral Derivative (PID) control algorithm, i.e. it's a PI controller. The proportional term sets the correction proportionally to the size of the error. This is sometimes called the gain of the controller. This is sufficient for many fast acting processes, but any process involving the draining of compute nodes is likely to have a period of some hours or days. In this application, pure proportional control is too sensitive to time lags and the control would be very poor. This, in this application, the proportional is used but it has a very low gain to damp down its sensitivity. The second term, integral action, is more important in this application. Integral actions sums (i.e. integrates, hence the name) the size of the error over time and feeds that in to the controller output as well. Thus, as the area under the error build over time, the control output grows to offset it. This eventually overcomes the offset and returns the measured variable to the set point.

Application

There are a few particulars to this application that affect the design of the controller.

First, the prime objectives of the system are to maximise the usage of the cluster and get good throughput of both single-core and multicore jobs. A good controller might be able to achieve this but there are a few problems to deal with.

  • Minimal negative corrections: To achieve control, the controller usually only puts more nodes into drain state. It never stops nodes draining, with one exception - once a drain starts, it usually completes. The purpose of this policy is that drains represent a cost to the system, and cancelling throws away ant achievement made from the draining. Just because there are few multicore jobs ion the queue at present doesn't mean some might not crop up at any time. It appears that cancelling drains, and throwing away the achievement made from the draining, could easily be premature. Instead, the nodes are left to drain out and put back into service, just in case a multicore jobs comes along and needs the slot. The only exception to this rules is when there are no multicore or single core jobs in the queue. In this case, the single core jobs are potentially being held back for now reason., It thins unique case, all draining is immediately cancelled to allow the single core nodes to be run.
  • Traffic problems: on a cluster, there is no guarantee that a constant supply of multicore jobs and single core jobs is available. There could be periods when the queue is depleted of one or both types of work. The controller will deal with these issues in the best way it can using these rules. If there are no multicore jobs queued, then it's pointless to start draining any systems because there are no jobs to fill the resulting wide slots. Also, if there are no multicore jobs but some single core jobs are queued, then the controller cancels the on-going drains to let the single core jobs run, otherwise the jobs would be held back for no valid reason. The truth table below shows the simple picture.


Queue state
mc jobs queued no yes no yes
sc jobs queued no no yes yes
Actions
start drain if nec. no yes no yes
cancel on-going drains no no yes no

Tuning

Tuning was done entirely by hand although there are technical ways to tune the system more accurately that I hope to research in future.

Current status

blah


Download

The DrainBoss controller is available here:

drainBoss.py

The DEFRAG daemon

This is the traditional approach to defragmentation used in the the initial version of the example build of an ARC/Condor cluster. It uses the DEFRAG daemon that comes with condor. To configure this set-up, you need to edit on the server the condor_config.local on the server, and create a script, set_defrag_parameters.sh, to control the amount of defragging. The script is operated by a cron job. Full details on this configuration are given ihte section of server files, above. The meaning of some important fragmentation parameters used to control the DEFRAG daemon is discussed next.

  • DEFRAG_INTERVAL – How often the daemon evaluates defrag status and sets systems draining.
  • DEFRAG_REQUIREMENTS – Only machines that fit these requirements will start to drain.
  • DEFRAG_DRAINING_MACHINES_PER_HOUR – Only this many machines will be set off draining each hour.
  • DEFRAG_MAX_WHOLE_MACHINES – Don't start any draining if you already have this many whole machines.
  • DEFRAG_MAX_CONCURRENT_DRAINING – Never drain more than this many machines at once.
  • DEFRAG_RANK – This allows you to prefer some machines over others to drain.
  • DEFRAG_WHOLE_MACHINE_EXPR – This defines whether a certain machine is whole or not.
  • DEFRAG_CANCEL_REQUIREMENTS – Draining will be stopped when a draining machine matches these requirements.

Note: The meaning of the ClassAds and parameters used to judge the fragmentation state of a machine is Byzantine in its complexity. The following definitions have been learned from experience.

The multicore set-up in CONDOR makes use of the idea of a abstract Partitonable Slot (PSlot) that can't run jobs but contains real slots of various sizes that can. In our set-up, every node has a single PSlot on it. Smaller "real" slots are made from it, each with either 1 single simultaneous thread of execution (a unislot) or 8 unislots for multicore jobs. The table below shows the meaning of some ClassAds used to express the usage of a node that is currently runing seven single core jobs (I think it's taken from an E5620 CPU).

The ClassAds in the first columns (Pslot) have the following meanings. DetectedCpus shows that the node has 16 hyper-threads in total - this is the hardware limit for simultaneous truly concurrent threads. The next row, TotalSlots, shows the size of the PSlot on this node. In this case, only 10 unislots can ever be used for jobs, unusing 6 unislots (note: it has been found that total throughput does not increase even if all the unislots are used so it is not inefficient to unuse 6 unislots.) Next, TotalSlots is equal to 8 in this case, which represents the total of all the used unislots in the sub slots, plus 1 to represent the PSlot. A value of 8 shows that this PSlot currently has seven of its unislots used by sub slots, and three unused. These could be used to make new sub slots to run jobs in. The last ClassAd, Cpus, represents the usable unislots in the PSlot that are left over (i.e. 3).

With respect to the sub slot columns, the DetectedCpus and TotalSlots values can be ignored as they are always the same. Both TotalSlot and Cpus in the sub slot columns represent how many unislots are in this sub slot.

It's as clear as mud, isn't it? But my experiments show it is consistent.

PSlot Sub slot Sub Slot Sub Slot Sub Slot Sub Slot Sub Slot Sub Slot Empty 3 unislots
DetectedCpus:
How Many
HyperThreads
e.g. 16
Ignore Ignore Ignore Ignore Ignore Ignore Ignore Empty
TotalSlotCpus:
How many CPUs
can be used
e.g. 10
Ignore Ignore Ignore Ignore Ignore Ignore Ignore Empty
TotalSlots:
Total of main plus
all sub slots
e.g. 8
TotalSlots:
How many unislots in
this sub slot.
e.g. 1
TotalSlots:
How many unislots in
this sub slot.
e.g. 1
TotalSlots:
How many unislots in
this sub slot.
e.g. 1
TotalSlots:
How many unislots in
this sub slot.
e.g. 1
TotalSlots:
How many unislots in
this sub slot.
e.g. 1
TotalSlots:
How many unislots in
this sub slot.
e.g. 1
TotalSlots:
How many unislots in
this sub slot.
e.g. 1
Empty
Cpus:
Usable unislots
left over

e.g. 3
As above
Always the same.
As above
Always the same.
As above
Always the same.
As above
Always the same.
As above
Always the same.
As above
Always the same.
As above
Always the same.
Empty


Further Work

blah blah blah


Also see