Example Build of an ARC/Condor Cluster
Contents
Introduction
A multi-core job is one which needs to use more than one processor on a node. Until recently, multi-core jobs have not been used much on the grid infrastructure. This has all changed because Atlas and other large users have now asked sites to enable multi-core on their clusters.
Unfortunately, it is not just a simple task of setting some parameter on the head node and sitting back while jobs arrive. Different grid system have varying levels of support for multi-core, ranging from non-existent to virtually full support.
This report discusses the multi-core configuration at Liverpool. We decided to build a test cluster using one of the most capable batch systems currently available, called HTCondor (or condor for short). We also decided to front the system with an ARC/Condor CE.
I thank Andrew Lahiff at RAL for the initial configuration and many suggestions and help. Some links to some of Andrew's material are in the “See Also” section.
Infrastructure/Fabric
The multicore test cluster consists of an SL6 headnode to run the ARC CE and the Condor batch system. The headnode has a dedicated set of N workernodes of various types, providing a total of 352 single threads of execution.
Head Node
The headnode is a virtual system running on KVM.
Host Name | OS | CPUs | RAM | Disk Space
|
---|---|---|---|---|
hepgrid2.ph.liv.ac.uk | SL6.4 | 5 | 3 gig | 35 gig |
Worker nodes
The physical workernodes are described below.
Node names | CPU type | OS | RAM | Disk Space | CPUs Per Node | Slots used per cpu | Slots used per node | Total nodes | Total CPUs | Total slots | HepSpec per slot | Total hepspec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
r21-n01 to n04 | E5620 | SL6.4 | 24 GB | 1.5 TB | 2 | 5 | 10 | 4 | 8 | 40 | 12.05 | 482 |
r26-n05 to n11 | L5420 | SL6.4 | 16 GB | 1.7 TB | 2 | 4 | 8 | 7 | 14 | 56 | 8.86 | 502
|
Software Builds and Configuration
There are a few particulars of the Liverpool site that I want to get out of the way to start with. For the initial installation of an operating system on our head nodes and worker nodes, we use tools developed at Liverpool (BuildTools) based on Kickstart, NFS, TFTP and DHCP. The source (synctool.pl and linktool.pl) can be obtained from sjones@hep.ph.liv.ac.uk. Alternatively, similar functionality is said to exist in the Cobler suite, which is released as Open Source and some sites have based their initial install on that. Once the OS is on, the first reboot starts Puppet to give a personality to the node. Puppet is becoming something of a de-facto standard in its own right, so I'll use puppet terminology within this document where some explanation of a particular feature is needed.
Special Software Control Measures
The software for the installation is all contained in various yum repositories. Here at Liverpool, we maintain two mirrored copies of the yum material. Once of them, the online repository, is mirrored daily from the internet. It is not used for any installation. The other copy, termed the local repository, is used to take a snapshot when necessary of the online repository. Installations are done from the local repository. Thus we maintain precise control of the software we use.
We'll start with the headnode and "work down" so to speak.
Head Node
Yum repos
Notwithstanding the special measures at Liverpool for software control, this table shows the origin of the software release via yum repositories.
Standard build
The basis for the initial build follows the standard model for any grid node at Liverpool. I won't explain that in detail – each site is likely to have its own standard, which is general all the components used to build any grid node (such as a CE, ARGUS, BDII, TORQUE etc.) but prior to any middle-ware. Such a baseline build might include networking, iptables, nagios scripts, ganglia, ssh etc. On the Liverpool build, I disable automatic yum updates.
Extra Directories
I had to make these specific ones myself:
/etc/arc/runtime/ENV /etc/lcmaps/ /root/scripts /root/glitecfg/services /var/spool/arc/debugging /var/spool/arc/jobstatus /var/spool/arc/grid
Additional Packages
These packages were needed to add the middle-ware required, i.e. Arc, Condor and ancillary material.
Package | Description |
---|---|
nordugrid-arc-compute-element | The ARC CE Middleware |
condor | HT Condor, the main batch server package, 8.2.2 |
apel-client | Accounting, ARC/Condor bypasses the APEL server and goes direct.
|
ca_policy_igtf-classic | Certificates |
lcas-plugins-basic | Security |
lcas-plugins-voms | Security |
lcas | Security |
lcmaps | Security |
lcmaps-plugins-basic | Security |
lcmaps-plugins-c-pep | Security |
lcmaps-plugins-verify-proxy | Security |
lcmaps-plugins-voms | Security
|
globus-ftp-control | Extra packages for Globus |
globus-gsi-callback | Extra packages for Globus
|
VomsSnooper | VOMS Helper, used to set up the LSC (list of Certificates) files |
glite-yaim-core | Yaim,just use Yaim to make accounts. |
yum-plugin-priorities.noarch | Helpers for Yum |
yum-plugin-protectbase.noarch | Helpers for Yum |
yum-utils | Helpers for Yum |
Files
The following set of files were additionally installed. Some of them are empty. Some of them can be used as they are. Others have to be edited to fit your site. Any that is a script must have executable permissions (e.g. 755).
- File: /root/scripts/set_defrag_parameters.sh
- Notes: This script senses changes to the running and queueing job load, and sets parameters related to defragmentation. This allows the cluster to support a load consisting of both multicore and singlecore jobs. It also has a section to cancel draining nodes if enough CPUs have been obtained. This may not be needed woth Condor 8.2.2.
- Customise: Yes. You'll need to edit it it to suit your site.
- Content:
#!/bin/bash # Change condor_defrag daemon parameters depending on how many queued and running multicore jobs there are. ############################ # PART 1 - STARTING DRAINS # ############################ function setDefrag () { # Get the address of the defrag daemon defrag_address=$(condor_status -any -autoformat MyAddress -constraint 'MyType =?= "Defrag"') # Log echo "Setting DEFRAG_MAX_CONCURRENT_DRAINING=$3, DEFRAG_DRAINING_MACHINES_PER_HOUR=$4, DEFRAG_MAX_WHOLE_MACHINES=$5 (queued multicore=$1, running multicore=$2)" # Set configuration /usr/bin/condor_config_val -address "$defrag_address" -rset "DEFRAG_MAX_CONCURRENT_DRAINING = $3" >& /dev/null /usr/bin/condor_config_val -address "$defrag_address" -rset "DEFRAG_DRAINING_MACHINES_PER_HOUR = $4" >& /dev/null /usr/bin/condor_config_val -address "$defrag_address" -rset "DEFRAG_MAX_WHOLE_MACHINES = $5" >& /dev/null /usr/sbin/condor_reconfig -daemon defrag >& /dev/null } queued_mc_jobs=$(condor_q -global -constraint 'RequestCpus == 8 && JobStatus == 1' -autoformat ClusterId | wc -l) queued_sc_jobs=$(condor_q -global -constraint 'RequestCpus == 1 && JobStatus == 1' -autoformat ClusterId | wc -l) running_mc_jobs=$(condor_q -global -constraint 'RequestCpus == 8 && JobStatus == 2' -autoformat ClusterId | wc -l) running_sc_jobs=$(condor_q -global -constraint 'RequestCpus == 1 && JobStatus == 2' -autoformat ClusterId | wc -l) # Rules in English # If there are no single core jobs queued, then defrag like mad. # (reason: only possibility is multi core, so defrag as much # as you can.) # # If only single core jobs, then defrag nothing. # (reason: why bother?) # # If both single core and multi core queued, then let's # apply some logic. If I have less than 6 multis in the # queue, don't do much defraging. # # If I have more than 6 multis, and less than 5 singles, # defrag a lot. # # If I have more than 6 multis and more than 5 singles, # defrag a bit. # # Note DMH * DEFRAG_INTERVAL / 3600 _must_ equate to more than 1 if [ $queued_sc_jobs -eq 0 ] then # No sc jobs, so total defrag! MCD DMH MWM setDefrag $queued_mc_jobs $running_mc_jobs 11 11 11 else if [ $queued_sc_jobs -gt 0 ] && [ $queued_mc_jobs -eq 0 ] then # Hardly defrag at all (as close to zero as poss) setDefrag $queued_mc_jobs $running_mc_jobs 1 1 1 else # There are some sc and some mc jobs; a mix if [ $queued_mc_jobs -gt 6 ] then if [ $running_mc_jobs -lt 5 ] then # Lots of mc, few sc so defrag a lot setDefrag $queued_mc_jobs $running_mc_jobs 3 2 6 else # Lots of mc, lots of sc too, so defrag a bit setDefrag $queued_mc_jobs $running_mc_jobs 2 1 4 fi else # Less than 6 mc, so hardly defrag at all setDefrag $queued_mc_jobs $running_mc_jobs 1 1 1 fi fi fi # Original, for AL, Ral # <=20 queued mc jobs MCD=1 , DMH=1 , MWM=4 # > 20 queued mc jobs, < 190 running mc jobs, MCD=60, DMH=40, MWM=300 # > 20 queued mc jobs, >= 190 running mc jobs, MCD=8 , DMH=8 , MWM=300 # English: # If there are hardly any mc jobs waiting, then # Hardly do any defragging # else # If there are only a few mc jobs already running then # Defrag a hell of a lot # Else # Defrag quite a log # # Note for future reference; DEFRAG_INTERVAL = 3601 # Note that there is some seriously stupid code around here. # 1) Bug https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3199 # DEFRAG_DRAINING_MACHINES_PER_HOUR * DEFRAG_INTERVAL / 3600 _must_ equate to more than 1, else no defrag occurs. # 2) New bug # DEFRAG_DRAINING_MACHINES_PER_HOUR must be more than 1.0. If I set it to (say) 0.5, 12 defrags occur at midnight. # ############################ # PART 2 - STOPPING DRAINS # ############################ # Get draining nodes for dn in `condor_status | grep Drained | sed -e "s/.*@//" -e "s/\..*//" `; do slot1=0 condor_status -long $dn| while read line; do # Toggle if slot1@ (not slot1_...). slot1@ lists the empty (i.e. drained) total if $line =~ ^Name.*slot1@.*$ ; then slot1=1 fi if $line =~ ^Name.*slot1_.*$ ; then slot1=0 fi if [ $slot1 == 1 ]; then if $line =~ ^Cpus\ \=\ (.*)$ ; then # We must capture empty/drained total cpus="${BASH_REMATCH[1]}" if [ $cpus -ge 8 ]; then # We have enough already. Pointless waiting longer. echo Cancel drain of $dn, as we have $cpus free already condor_drain -cancel $dn fi fi fi done done
- File: /etc/arc.conf
- Notes: The main configuration file of the ARC CE. It adds support for scaling factors, APEL reporting, ARGUS Mapping, BDII publishing (power and scaling), multiple VO support, and default limits.
- Customise: Yes. You'll need to edit it it to suit your site.
- Content:
[common] x509_user_key="/etc/grid-security/hostkey.pem" x509_user_cert="/etc/grid-security/hostcert.pem" x509_cert_dir="/etc/grid-security/certificates" gridmap="/etc/grid-security/grid-mapfile" lrms="condor" [grid-manager] enable_emies_interface="yes" arex_mount_point="https://hepgrid2.ph.liv.ac.uk:443/arex" user="root" controldir="/var/spool/arc/jobstatus" sessiondir="/var/spool/arc/grid" runtimedir="/etc/arc/runtime" logfile="/var/log/arc/grid-manager.log" pidfile="/var/run/grid-manager.pid" joblog="/var/log/arc/gm-jobs.log" shared_filesystem="no" authplugin="PREPARING timeout=60,onfailure=pass,onsuccess=pass /usr/local/bin/default_rte_plugin.py %S %C %I ENV/GLITE" #authplugin="PREPARING timeout=60,onfailure=pass,onsuccess=pass /bin/sleep 30 " authplugin="FINISHING timeout=60,onfailure=pass,onsuccess=pass /usr/local/bin/scaling_factors_plugin.py %S %C %I" # This copies the files containing useful output from completed jobs into a directory /var/spool/arc/debugging #authplugin="FINISHED timeout=60,onfailure=pass,onsuccess=pass /usr/local/bin/debugging_rte_plugin.py %S %C %I" mail="root@hep.ph.liv.ac.uk" jobreport="APEL:http://mq.cro-ngi.hr:6162" jobreport_options="urbatch:1000,archiving:/var/run/arc/urs,topic:/queue/global.accounting.cpu.central,gocdb_name:UKI-NORTHGRID-LIV-HEP,use_ssl:true,Network:PROD,benchmark_type:Si2k,benchmark_value:2500.00" jobreport_credentials="/etc/grid-security/hostkey.pem /etc/grid-security/hostcert.pem /etc/grid-security/certificates" jobreport_publisher="jura" # Disable (4 year period!) jobreport_period=126144000 [gridftpd] user="root" debug="1" logfile="/var/log/arc/gridftpd.log" pidfile="/var/run/gridftpd.pid" port="2811" allowunknown="yes" globus_tcp_port_range="20000,24999" globus_udp_port_range="20000,24999" # # Notes: # # The first two args are implicitly given to arc-lcmaps, and are # argv[1] - the subject/DN # argv[2] - the proxy file # # The remain attributes are explicit, after the "lcmaps" field in the examples below. # argv[3] - lcmaps_library # argv[4] - lcmaps_dir # argv[5] - lcmaps_db_file # argv[6 etc.] - policynames # # lcmaps_dir and/or lcmaps_db_file may be '*', in which case they are # fully truncated (placeholders). # # Some logic is applied. If the lcmaps_library is not specified with a # full path, it is given the path of the lcmaps_dir. We have to assume that # the lcmaps_dir is a poor name for that field, as discussed in the following # examples. # # Examples: # In this example, used at RAL, the liblcmaps.so is given no # path, so it is assumes to exist in /usr/lib64 (note the poorly # named field - the lcmaps_dir is populated by a library path.) # # Fieldnames: lcmaps_lib lcmaps_dir lcmaps_db_file policy #unixmap="* lcmaps liblcmaps.so /usr/lib64 /usr/etc/lcmaps/lcmaps.db arc" # # In the next example, used at Liverpool, lcmaps_lib is fully qualified. Thus # the lcmaps_dir is not used (although is does set the LCMAPS_DIR env var). # In this case, the lcmaps_dir really does contain the lcmaps dir location. # # Fieldnames: lcmaps_lib lcmaps_dir lcmaps_db_file policy unixmap="* lcmaps /usr/lib64/liblcmaps.so /etc/lcmaps lcmaps.db arc" unixmap="arcfailnonexistentaccount:arcfailnonexistentaccount all" [gridftpd/jobs] path="/jobs" plugin="jobplugin.so" allownew="yes" [infosys] user="root" overwrite_config="yes" port="2135" debug="1" registrationlog="/var/log/arc/inforegistration.log" providerlog="/var/log/arc/infoprovider.log" provider_loglevel="2" infosys_glue12="enable" infosys_glue2_ldap="enable" [infosys/glue12] resource_location="Liverpool, UK" resource_longitude="-2.964" resource_latitude="53.4035" glue_site_web="http://www.gridpp.ac.uk/northgrid/liverpool" glue_site_unique_id="UKI-NORTHGRID-LIV-HEP" cpu_scaling_reference_si00="2562" processor_other_description="Cores=4.36,Benchmark=10.25-HEP-SPEC06" provide_glue_site_info="false" [infosys/admindomain] name="UKI-NORTHGRID-LIV-HEP" # infosys view of the computing cluster (service) [cluster] name="hepgrid2.ph.liv.ac.uk" localse="hepgrid11.ph.liv.ac.uk" cluster_alias="hepgrid2 (UKI-NORTHGRID-LIV-HEP)" comment="UKI-NORTHGRID-LIV-HEP Main Grid Cluster" homogeneity="True" nodecpu="Xeon" architecture="x86_64" nodeaccess="inbound" nodeaccess="outbound" #opsys="SL64" opsys="ScientificSL" nodememory="16000" authorizedvo="alice" authorizedvo="atlas" authorizedvo="lhcb" authorizedvo="gridpp" authorizedvo="cms" authorizedvo="ops" authorizedvo="dteam" benchmark="SPECINT2000 2240" benchmark="SPECFP2000 2240" totalcpus=56 [queue/grid] name="grid" homogeneity="True" comment="Default queue" nodecpu="adotf" architecture="adotf" defaultmemory=2000 maxrunning=1400 totalcpus=56 maxuserrun=1400 maxqueuable=2800
- File: /etc/arc/runtime/ENV/GLITE
- Notes: Implements the GLITE run time environment.
- Customise: Yes. You'll need to edit it it to suit your site.
- Content:
#!/bin/sh if [ "x$1" = "x0" ]; then # Set environment variable containing queue name env_idx=0 env_var="joboption_env_$env_idx" while [ -n "${!env_var}" ]; do env_idx=$((env_idx+1)) env_var="joboption_env_$env_idx" done eval joboption_env_$env_idx="NORDUGRID_ARC_QUEUE=$joboption_queue" export RUNTIME_ENABLE_MULTICORE_SCRATCH=1 fi if [ "x$1" = "x1" ]; then # Set grid environment if [ -e /etc/profile.d/env.sh ]; then source /etc/profile.d/env.sh fi if [ -e /etc/profile.d/zz-env.sh ]; then source /etc/profile.d/zz-env.sh fi export LD_LIBRARY_PATH=/opt/xrootd/lib # Set basic environment variables export GLOBUS_LOCATION=/usr HOME=`pwd` export HOME USER=`whoami` export USER HOSTNAME=`hostname -f` export HOSTNAME fi export VO_ALICE_SW_DIR=/opt/exp_soft_sl5/alice export VO_ATLAS_SW_DIR=/cvmfs/atlas.cern.ch/repo/sw export VO_LHCB_SW_DIR=/cvmfs/lhcb.cern.ch export VO_GRIDPP_SW_DIR=/opt/exp_soft_sl5/gridpp export VO_CMS_SW_DIR=/opt/exp_soft_sl5/cms export VO_OPS_SW_DIR=/opt/exp_soft_sl5/ops export VO_DTEAM_SW_DIR=/opt/exp_soft_sl5/dteam export RUCIO_HOME=/cvmfs/atlas.cern.ch/repo/sw/ddm/rucio-clients/0.1.12 export RUCIO_AUTH_TYPE=x509_proxy
- File: /etc/condor/config.d/14accounting-groups-map.config
- Notes: Implements accounting groups, so that fairshares can be used that refer to whole groups of users, instead of individual ones.
- Customise: Yes. You'll need to edit it to suit your site.
- Content:
# Liverpool Tier-2 HTCondor configuration: accounting groups # Primary group # Assign individual test submitters into the HIGHPRIO group, # else just assign job into primary group of its VO LivAcctGroup = ifThenElse(regexp("sgmatl34",Owner), "group_HIGHPRIO", \ ifThenElse(regexp("sgmops11",Owner), "group_HIGHPRIO", \ strcat("group_",toUpper(x509UserProxyVOName)))) # Subgroup # For the subgroup, just assign job to the group of the owner (i.e. owner name less all those digits at the end). # Also show whether multi or single core. LivAcctSubGroup = strcat(regexps("([A-Za-z0-9]+[A-Za-z])\d+", Owner, "\1"),ifThenElse(RequestCpus > 1,"_mcore","_score")) # Now build up the whole accounting group AccountingGroup = strcat(LivAcctGroup, ".", LivAcctSubGroup, ".", Owner) # Add these ClassAd specifications to the submission expressions SUBMIT_EXPRS = $(SUBMIT_EXPRS) LivAcctGroup, LivAcctSubGroup, AccountingGroup
- File: /etc/condor/config.d/11fairshares.config
- Notes: Implements fair share settings, relying on groups of users.
- Customise: Yes. You'll need to edit it to suit your site.
- Content:
# Liverpool Tier-2 HTCondor configuration: fairshares # use this to stop jobs from starting. # CONCURRENCY_LIMIT_DEFAULT = 0 # Half-life of user priorities PRIORITY_HALFLIFE = 259200 # Handle surplus GROUP_ACCEPT_SURPLUS = True GROUP_AUTOREGROUP = True # Weight slots using CPUs #NEGOTIATOR_USE_SLOT_WEIGHTS = True # See: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3271 NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION = False # Calculate the surplus allocated to each group correctly NEGOTIATOR_USE_WEIGHTED_DEMAND = True # Group names GROUP_NAMES = \ group_HIGHPRIO, \ group_ALICE, \ group_ATLAS, \ group_HYPERK_ORG, \ group_BIOMED, \ group_CALICE, \ group_CAMONT, \ group_CDF, \ group_CMS, \ group_DTEAM, \ group_DZERO, \ group_ESR, \ group_FUSION, \ group_GEANT4, \ group_HONE, \ group_GRIDPP, \ group_ILC, \ group_LHCB, \ group_MAGIC, \ group_EPIC_VO_GRIDPP_AC_UK, \ group_MICE, \ group_OPS, \ group_PHENO, \ group_PLANCK, \ group_CERNATSCHOOL_ORG, \ group_T2K_ORG, \ group_NEISS_ORG_UK, \ group_ZEUS, \ group_VO_NORTHGRID_AC_UK, \ group_VO_SIXT_CERN_CH, \ group_SNOPLUS_SNOLAB_CA, \ group_NA62_VO_GRIDPP_AC_UK, \ # Fairshares GROUP_QUOTA_DYNAMIC_group_HIGHPRIO = 0.05 #GROUP_QUOTA_DYNAMIC_group_DTEAM = 0.90 #GROUP_QUOTA_DYNAMIC_group_GRIDPP = 0.10 GROUP_QUOTA_DYNAMIC_group_ALICE = 0.05 GROUP_QUOTA_DYNAMIC_group_ATLAS = 0.65 GROUP_QUOTA_DYNAMIC_group_HYPERK_ORG = 0.01 GROUP_QUOTA_DYNAMIC_group_BIOMED = 0.01 GROUP_QUOTA_DYNAMIC_group_CALICE = 0.01 GROUP_QUOTA_DYNAMIC_group_CAMONT = 0.01 GROUP_QUOTA_DYNAMIC_group_CDF = 0.01 GROUP_QUOTA_DYNAMIC_group_CMS = 0.01 GROUP_QUOTA_DYNAMIC_group_DTEAM = 0.01 GROUP_QUOTA_DYNAMIC_group_DZERO = 0.01 GROUP_QUOTA_DYNAMIC_group_ESR = 0.01 GROUP_QUOTA_DYNAMIC_group_FUSION = 0.01 GROUP_QUOTA_DYNAMIC_group_GEANT4 = 0.01 GROUP_QUOTA_DYNAMIC_group_HONE = 0.01 GROUP_QUOTA_DYNAMIC_group_GRIDPP = 0.01 GROUP_QUOTA_DYNAMIC_group_ILC = 0.01 GROUP_QUOTA_DYNAMIC_group_LHCB = 0.20 GROUP_QUOTA_DYNAMIC_group_MAGIC = 0.01 GROUP_QUOTA_DYNAMIC_group_EPIC_VO_GRIDPP_AC_UK = 0.01 GROUP_QUOTA_DYNAMIC_group_MICE = 0.01 GROUP_QUOTA_DYNAMIC_group_OPS = 0.01 GROUP_QUOTA_DYNAMIC_group_PHENO = 0.01 GROUP_QUOTA_DYNAMIC_group_PLANCK = 0.01 GROUP_QUOTA_DYNAMIC_group_CERNATSCHOOL_ORG = 0.01 GROUP_QUOTA_DYNAMIC_group_T2K_ORG = 0.01 GROUP_QUOTA_DYNAMIC_group_NEISS_ORG_UK = 0.01 GROUP_QUOTA_DYNAMIC_group_ZEUS = 0.01 GROUP_QUOTA_DYNAMIC_group_VO_NORTHGRID_AC_UK = 0.01 GROUP_QUOTA_DYNAMIC_group_VO_SIXT_CERN_CH = 0.01 GROUP_QUOTA_DYNAMIC_group_SNOPLUS_SNOLAB_CA = 0.01 GROUP_QUOTA_DYNAMIC_group_NA62_VO_GRIDPP_AC_UK = 0.01
- File: /etc/condor/pool_password
- Notes: Will have its own section (TBD)
- Customise: Yes.
- Content – the section below is taken from the condor systems administrator guide and tells how to generate this file.
Password Authentication The password method provides mutual authentication through the use of a shared secret. This is often a good choice when strong security is desired, but an existing Kerberos or X.509 infrastructure is not in place. Password authentication is available on both Unix andWindows. It currently can only be used for daemon -to-daemon authentication. The shared secret in this context is referred to as the pool password. Before a daemon can use password authentication, the pool password must be stored on the daemon’s local machine. On Unix, the password will be placed in a file defined by the configuration variable SEC_PASSWORD_FILE. This file will be accessible only by the UID that HTCondor is started as. OnWindows, the same secure password store that is used for user passwords will be used for the pool password (see section 7.2.3). Under Unix, the password file can be generated by using the following command to write directly to the password file: condor_store_cred -f /path/to/password/file
- File: /etc/condor/condor_config.local
- Notes: The main condor configuration custom file. Implements part of the scaling factor logic, reasonable security parameters, and defragmentation logic (to allow both multicore and singlecore job loads).
- Customise: Yes. You'll need to edit it to suit your site.
- Content:
## What machine is your central manager? CONDOR_HOST = $(FULL_HOSTNAME) ## Pool's short description COLLECTOR_NAME = Condor at $(FULL_HOSTNAME) ## When is this machine willing to start a job? START = FALSE ## When to suspend a job? SUSPEND = FALSE ## When to nicely stop a job? # When a job is running and the PREEMPT expression evaluates to True, the # condor_startd will evict the job. The PREEMPT expression s hould reflect the # requirements under which the machine owner will not permit a job to continue to run. # For example, a policy to evict a currently running job when a key is hit or when # it is the 9:00am work arrival time, would be expressed in the PREEMPT expression # and enforced by the condor_startd. PREEMPT = FALSE # If there is a job from a higher priority user sitting idle, the # condor_negotiator daemon may evict a currently running job submitted # from a lower priority user if PREEMPTION_REQUIREMENTS is True. PREEMPTION_REQUIREMENTS = FALSE # No job has pref over any other #RANK = FALSE ## When to instantaneously kill a preempting job ## (e.g. if a job is in the pre-empting stage for too long) KILL = FALSE ## This macro determines what daemons the condor_master will start and keep its watchful eyes on. ## The list is a comma or space separated list of subsystem names DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD ####################################### # Andrew Lahiff's scaling MachineRalScaling = "$$([ifThenElse(isUndefined(RalScaling), 1.00, RalScaling)])" SUBMIT_EXPRS = $(SUBMIT_EXPRS) MachineRalScaling ####################################### # Andrew Lahiff's security ALLOW_WRITE = UID_DOMAIN = ph.liv.ac.uk CENTRAL_MANAGER1 = hepgrid2.ph.liv.ac.uk COLLECTOR_HOST = $(CENTRAL_MANAGER1) # Central managers CMS = condor_pool@$(UID_DOMAIN)/hepgrid2.ph.liv.ac.uk # CEs CES = condor_pool@$(UID_DOMAIN)/hepgrid2.ph.liv.ac.uk # Worker nodes WNS = condor_pool@$(UID_DOMAIN)/192.168.* # Users USERS = *@$(UID_DOMAIN) USERS = * # Required for HA HOSTALLOW_NEGOTIATOR = $(COLLECTOR_HOST) HOSTALLOW_ADMINISTRATOR = $(COLLECTOR_HOST) HOSTALLOW_NEGOTIATOR_SCHEDD = $(COLLECTOR_HOST) # Authorization HOSTALLOW_WRITE = ALLOW_READ = */*.ph.liv.ac.uk NEGOTIATOR.ALLOW_WRITE = $(CES), $(CMS) COLLECTOR.ALLOW_ADVERTISE_MASTER = $(CES), $(CMS), $(WNS) COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(CES) COLLECTOR.ALLOW_ADVERTISE_STARTD = $(WNS) SCHEDD.ALLOW_WRITE = $(USERS) SHADOW.ALLOW_WRITE = $(WNS), $(CES) ALLOW_DAEMON = condor_pool@$(UID_DOMAIN)/*.ph.liv.ac.uk, $(FULL_HOSTNAME) ALLOW_ADMINISTRATOR = root@$(UID_DOMAIN)/$(IP_ADDRESS), condor_pool@$(UID_DOMAIN)/$(IP_ADDRESS), $(CMS) ALLOW_CONFIG = root@$(FULL_HOSTNAME) # Don't allow nobody to run jobs SCHEDD.DENY_WRITE = nobody@$(UID_DOMAIN) # Authentication SEC_PASSWORD_FILE = /etc/condor/pool_password SEC_DEFAULT_AUTHENTICATION = REQUIRED SEC_READ_AUTHENTICATION = OPTIONAL SEC_CLIENT_AUTHENTICATION = REQUIRED SEC_DEFAULT_AUTHENTICATION_METHODS = PASSWORD,FS SCHEDD.SEC_WRITE_AUTHENTICATION_METHODS = FS,PASSWORD SCHEDD.SEC_DAEMON_AUTHENTICATION_METHODS = FS,PASSWORD SEC_CLIENT_AUTHENTICATION_METHODS = FS,PASSWORD,CLAIMTOBE SEC_READ_AUTHENTICATION_METHODS = FS,PASSWORD,CLAIMTOBE # Integrity SEC_DEFAULT_INTEGRITY = REQUIRED SEC_DAEMON_INTEGRITY = REQUIRED SEC_NEGOTIATOR_INTEGRITY = REQUIRED # Multicore DAEMON_LIST = $(DAEMON_LIST) DEFRAG DEFRAG_SCHEDULE = graceful # Note that there is some seriously stupid code around here. # 1) Bug https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3199 # DEFRAG_DRAINING_MACHINES_PER_HOUR * DEFRAG_INTERVAL / 3600 _must_ equate to more than 1, else no defrag occurs. # 2) New bug # DEFRAG_DRAINING_MACHINES_PER_HOUR must be more than 1.0. If I set it to (say) 0.5, 12 defrags occur at midnight. DEFRAG_INTERVAL = 3602 DEFRAG_MAX_CONCURRENT_DRAINING = 1 DEFRAG_DRAINING_MACHINES_PER_HOUR = 1.0 DEFRAG_MAX_WHOLE_MACHINES = 4 ## Allow some defrag configuration to be settable DEFRAG.SETTABLE_ATTRS_ADMINISTRATOR = DEFRAG_MAX_CONCURRENT_DRAINING,DEFRAG_DRAINING_MACHINES_PER_HOUR,DEFRAG_MAX_WHOLE_MACHINES ENABLE_RUNTIME_CONFIG = TRUE ## Which machines are more desirable to drain # If free cpus already 8 or more, give it a very low rank - i.e do no defrag, pointless. # Otherwise, assuming a machine with 16 TotalCpus, if it has fewer than 8 free Cpus # e.g. say it has 4 free : (16 - 4)/(8 - 4) = 3 # e.g. say it has 5 free : (16 - 5)/(8 - 5) = 3.66 # Ergo, the nearer the machine is to having the full 8, the higher its rank. DEFRAG_RANK = ifThenElse(Cpus >= 8, -10, (TotalCpus - Cpus)/(8.0 - Cpus)) # Definition of a "whole" machine: # - anything with 8 cores (since multicore jobs only need 8 cores, don't need to drain whole machines with > 8 cores) # - must be configured to actually start new jobs (otherwise machines which are deliberately being drained will be included) DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8)) && StartJobs =?= True # Cancel once we have 8 DEFRAG_CANCEL_REQUIREMENTS = ((Cpus == TotalCpus) || (Cpus >= 8)) # Decide which machines to drain # - must not be cloud machines (n/a at Liverpool) # - must be healthy (n/a at Liverpool) # - must be configured to actually start new jobs #DEFRAG_REQUIREMENTS = PartitionableSlot && Offline =!= True && StartJobs =?= True ## Failed every time DEFRAG_REQUIREMENTS = PartitionableSlot && Offline =!= True ## Logs MAX_DEFRAG_LOG = 104857600 MAX_NUM_DEFRAG_LOG = 10 #DEFRAG_DEBUG = D_FULLDEBUG #NEGOTIATOR_DEBUG = D_FULLDEBUG # Port limits HIGHPORT = 65000 LOWPORT = 20000
# History HISTORY = $(SPOOL)/history
- File: /etc/ld.so.conf.d/condor.conf
- Notes: Condor needed this to access its libraries. I had to run “ldconfig” to make it take hold.
- Customise: Maybe not.
- Content:
/usr/lib64/condor/
- File: /usr/local/bin/scaling_factors_plugin.py
- Notes: This implements another part of the scaling factor logic.
- Customise: It should be generic.
- Content:
#!/usr/bin/python # Copyright 2014 Science and Technology Facilities Council # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import re from os.path import isfile """Usage: scaling_factors_plugin.py <status> <control dir> <jobid> Authplugin for FINISHING STATE Example: authplugin="FINISHING timeout=60,onfailure=pass,onsuccess=pass /usr/local/bin/scaling_factors_plugin.py %S %C %I" """ def ExitError(msg,code): """Print error message and exit""" from sys import exit print(msg) exit(code) def GetScalingFactor(control_dir, jobid): errors_file = '%s/job.%s.errors' %(control_dir,jobid) if not isfile(errors_file): ExitError("No such errors file: %s"%errors_file,1) f = open(errors_file) errors = f.read() f.close() scaling = -1 m = re.search('MATCH_EXP_MachineRalScaling = \"([\dE\+\-\.]+)\"', errors) if m: scaling = float(m.group(1)) return scaling def SetScaledTimes(control_dir, jobid): scaling_factor = GetScalingFactor(control_dir, jobid) diag_file = '%s/job.%s.diag' %(control_dir,jobid) if not isfile(diag_file): ExitError("No such errors file: %s"%diag_file,1) f = open(diag_file) lines = f.readlines() f.close() newlines = [] types = ['WallTime=', 'UserTime=', 'KernelTime='] for line in lines: for type in types: if type in line and scaling_factor > 0: m = re.search('=(\d+)s', line) if m: scaled_time = int(float(m.group(1))*scaling_factor) line = type + str(scaled_time) + 's\n' newlines.append(line) fw = open(diag_file, "w") fw.writelines(newlines) fw.close() return 0 def main(): """Main""" import sys # Parse arguments if len(sys.argv) == 4: (exe, status, control_dir, jobid) = sys.argv else: ExitError("Wrong number of arguments\n"+__doc__,1) if status == "FINISHING": SetScaledTimes(control_dir, jobid) sys.exit(0) sys.exit(1) if __name__ == "__main__": main()
- File: /usr/local/bin/debugging_rte_plugin.py
- Notes: Useful for capturing debug output.
- Customise: It should be generic.
- Content:
#!/usr/bin/python # This copies the files containing useful output from completed jobs into a directory import shutil """Usage: debugging_rte_plugin.py <status> <control dir> <jobid> Authplugin for FINISHED STATE Example: authplugin="FINISHED timeout=60,onfailure=pass,onsuccess=pass /usr/local/bin/debugging_rte_plugin.py %S %C %I" """ def ExitError(msg,code): """Print error message and exit""" from sys import exit print(msg) exit(code) def ArcDebuggingL(control_dir, jobid): from os.path import isfile try: m = open("/var/spool/arc/debugging/msgs", 'a') except IOError , err: print err.errno print err.strerror local_file = '%s/job.%s.local' %(control_dir,jobid) grami_file = '%s/job.%s.grami' %(control_dir,jobid) if not isfile(local_file): ExitError("No such description file: %s"%local_file,1) if not isfile(grami_file): ExitError("No such description file: %s"%grami_file,1) lf = open(local_file) local = lf.read() lf.close() if 'Organic Units' in local or 'stephen jones' in local: shutil.copy2(grami_file, '/var/spool/arc/debugging') f = open(grami_file) grami = f.readlines() f.close() for line in grami: m.write(line) if 'joboption_directory' in line: comment = line[line.find("'")+1:line.find("'",line.find("'")+1)]+'.comment' shutil.copy2(comment, '/var/spool/arc/debugging') if 'joboption_stdout' in line: mystdout = line[line.find("'")+1:line.find("'",line.find("'")+1)] m.write("Try Copy mystdout - " + mystdout + "\n") if isfile(mystdout): m.write("Copy mystdout - " + mystdout + "\n") shutil.copy2(mystdout, '/var/spool/arc/debugging') else: m.write("mystdout gone - " + mystdout + "\n") if 'joboption_stderr' in line: mystderr = line[line.find("'")+1:line.find("'",line.find("'")+1)] m.write("Try Copy mystderr - " + mystderr + "\n") if isfile(mystderr): m.write("Copy mystderr - " + mystderr + "\n") shutil.copy2(mystderr, '/var/spool/arc/debugging') else: m.write("mystderr gone - " + mystderr + "\n") close(m) return 0 def main(): """Main""" import sys # Parse arguments if len(sys.argv) == 4: (exe, status, control_dir, jobid) = sys.argv else: ExitError("Wrong number of arguments\n",1) if status == "FINISHED": ArcDebuggingL(control_dir, jobid) sys.exit(0) sys.exit(1) if __name__ == "__main__": main()
- File: /usr/local/bin/default_rte_plugin.py
- Notes: Sets up the default run time environment.
- Customise: It should be generic.
- Content:
#!/usr/bin/python # Copyright 2014 Science and Technology Facilities Council # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Usage: default_rte_plugin.py <status> <control dir> <jobid> <runtime environment> Authplugin for PREPARING STATE Example: authplugin="PREPARING timeout=60,onfailure=pass,onsuccess=pass /usr/local/bin/default_rte_plugin.py %S %C %I <rte>" """ def ExitError(msg,code): """Print error message and exit""" from sys import exit print(msg) exit(code) def SetDefaultRTE(control_dir, jobid, default_rte): from os.path import isfile desc_file = '%s/job.%s.description' %(control_dir,jobid) if not isfile(desc_file): ExitError("No such description file: %s"%desc_file,1) f = open(desc_file) desc = f.read() f.close() if default_rte not in desc: with open(desc_file, "a") as myfile: myfile.write("( runtimeenvironment = \"" + default_rte + "\" )") return 0 def main(): """Main""" import sys # Parse arguments if len(sys.argv) == 5: (exe, status, control_dir, jobid, default_rte) = sys.argv else: ExitError("Wrong number of arguments\n"+__doc__,1) if status == "PREPARING": SetDefaultRTE(control_dir, jobid, default_rte) sys.exit(0) sys.exit(1) if __name__ == "__main__": main()
- File: /etc/lcmaps/lcmaps.db
- Notes: Connects the authentication layer to an ARGUS server
- Customise: Yes. It must be changed to suit your site.
- Content:
path = /usr/lib64/lcmaps verify_proxy = "lcmaps_verify_proxy.mod" "-certdir /etc/grid-security/certificates" "--discard_private_key_absence" "--allow-limited-proxy" pepc = "lcmaps_c_pep.mod" "--pep-daemon-endpoint-url https://hepgrid9.ph.liv.ac.uk:8154/authz" "--resourceid http://authz-interop.org/xacml/resource/resource-type/arc" "--actionid http://glite.org/xacml/action/execute" "--capath /etc/grid-security/certificates/" "--certificate /etc/grid-security/hostcert.pem" "--key /etc/grid-security/hostkey.pem" # Policies: arc: verify_proxy -> pepc
- File: /etc/profile.d/env.sh
- Notes: Sets up environment variables for specific VO jobs.
- Customise: Yes. It must be changed to suit your site.
- Content:
if [ "X${GLITE_ENV_SET+X}" = "X" ]; then . /usr/libexec/grid-env-funcs.sh if [ "x${GLITE_UI_ARCH:-$1}" = "x32BIT" ]; then arch_dir=lib; else arch_dir=lib64; fi gridpath_prepend "PATH" "/bin" gridpath_prepend "MANPATH" "/opt/glite/share/man" gridenv_set "VO_ZEUS_SW_DIR" "/opt/exp_soft_sl5/zeus" gridenv_set "VO_ZEUS_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_VO_SIXT_CERN_CH_SW_DIR" "/opt/exp_soft_sl5/sixt" gridenv_set "VO_VO_SIXT_CERN_CH_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_VO_NORTHGRID_AC_UK_SW_DIR" "/opt/exp_soft_sl5/northgrid" gridenv_set "VO_VO_NORTHGRID_AC_UK_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_T2K_ORG_SW_DIR" "/cvmfs/t2k.gridpp.ac.uk" gridenv_set "VO_T2K_ORG_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_SNOPLUS_SNOLAB_CA_SW_DIR" "/cvmfs/snoplus.gridpp.ac.uk" gridenv_set "VO_SNOPLUS_SNOLAB_CA_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_PLANCK_SW_DIR" "/opt/exp_soft_sl5/planck" gridenv_set "VO_PLANCK_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_PHENO_SW_DIR" "/opt/exp_soft_sl5/pheno" gridenv_set "VO_PHENO_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_OPS_SW_DIR" "/opt/exp_soft_sl5/ops" gridenv_set "VO_OPS_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_NEISS_ORG_UK_SW_DIR" "/opt/exp_soft_sl5/neiss" gridenv_set "VO_NEISS_ORG_UK_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_NA62_VO_GRIDPP_AC_UK_SW_DIR" "/cvmfs/na62.gridpp.ac.uk" gridenv_set "VO_NA62_VO_GRIDPP_AC_UK_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_MICE_SW_DIR" "/cvmfs/mice.gridpp.ac.uk" gridenv_set "VO_MICE_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_MAGIC_SW_DIR" "/opt/exp_soft_sl5/magic" gridenv_set "VO_MAGIC_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_LHCB_SW_DIR" "/cvmfs/lhcb.cern.ch" gridenv_set "VO_LHCB_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_ILC_SW_DIR" "/cvmfs/ilc.desy.de" gridenv_set "VO_ILC_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_HONE_SW_DIR" "/cvmfs/hone.gridpp.ac.uk" gridenv_set "VO_HONE_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_GRIDPP_SW_DIR" "/opt/exp_soft_sl5/gridpp" gridenv_set "VO_GRIDPP_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_GEANT4_SW_DIR" "/opt/exp_soft_sl5/geant4" gridenv_set "VO_GEANT4_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_FUSION_SW_DIR" "/opt/exp_soft_sl5/fusion" gridenv_set "VO_FUSION_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_ESR_SW_DIR" "/opt/exp_soft_sl5/esr" gridenv_set "VO_ESR_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_EPIC_VO_GRIDPP_AC_UK_SW_DIR" "/opt/exp_soft_sl5/epic" gridenv_set "VO_EPIC_VO_GRIDPP_AC_UK_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_DZERO_SW_DIR" "/opt/exp_soft_sl5/dzero" gridenv_set "VO_DZERO_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_DTEAM_SW_DIR" "/opt/exp_soft_sl5/dteam" gridenv_set "VO_DTEAM_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_CMS_SW_DIR" "/opt/exp_soft_sl5/cms" gridenv_set "VO_CMS_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_CERNATSCHOOL_ORG_SW_DIR" "/cvmfs/cernatschool.gridpp.ac.uk" gridenv_set "VO_CERNATSCHOOL_ORG_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_CDF_SW_DIR" "/opt/exp_soft_sl5/cdf" gridenv_set "VO_CDF_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_CAMONT_SW_DIR" "/opt/exp_soft_sl5/camont" gridenv_set "VO_CAMONT_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_CALICE_SW_DIR" "/opt/exp_soft_sl5/calice" gridenv_set "VO_CALICE_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_BIOMED_SW_DIR" "/opt/exp_soft_sl5/biomed" gridenv_set "VO_BIOMED_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_ATLAS_SW_DIR" "/cvmfs/atlas.cern.ch/repo/sw" gridenv_set "VO_ATLAS_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "VO_ALICE_SW_DIR" "/opt/exp_soft_sl5/alice" gridenv_set "VO_ALICE_DEFAULT_SE" "hepgrid11.ph.liv.ac.uk" gridenv_set "SITE_NAME" "UKI-NORTHGRID-LIV-HEP" gridenv_set "SITE_GIIS_URL" "hepgrid4.ph.liv.ac.uk" gridenv_set "RFIO_PORT_RANGE" ""20000,25000"" gridenv_set "MYPROXY_SERVER" "lcgrbp01.gridpp.rl.ac.uk" gridenv_set "LCG_LOCATION" "/usr" gridenv_set "LCG_GFAL_INFOSYS" "lcg-bdii.gridpp.ac.uk:2170,topbdii.grid.hep.ph.ic.ac.uk:2170" gridenv_set "GT_PROXY_MODE" "old" gridenv_set "GRID_ENV_LOCATION" "/usr/libexec" gridenv_set "GRIDMAPDIR" "/etc/grid-security/gridmapdir" gridenv_set "GLITE_LOCATION_VAR" "/var" gridenv_set "GLITE_LOCATION" "/usr" gridenv_set "GLITE_ENV_SET" "TRUE" gridenv_set "GLEXEC_LOCATION" "/usr" gridenv_set "DPNS_HOST" "hepgrid11.ph.liv.ac.uk" gridenv_set "DPM_HOST" "hepgrid11.ph.liv.ac.uk" . /usr/libexec/clean-grid-env-funcs.sh fi
- File: /etc/grid-security/grid-mapfile
- Notes: Useful for directly mapping a user for testing. Superseded by ARGUS now, so optional.
- Customise: Yes. It must be changed to suit your site.
- Content:
"/C=UK/O=eScience/OU=Liverpool/L=CSD/CN=stephen jones" dteam184
- File: /root/glitecfg/site-info.def
- Notes: Just a copy of the site standard SID file. Used to make the accounts.
- Content: as per site standard
- File: /opt/glite/yaim/etc/users.conf
- Notes: Just a copy of the site standard users.conf file. Used to make the accounts.
- Content: as per site standard
- File: /opt/glite/yaim/etc/groups.conf
- Notes: Just a copy of the site standard groups.conf file. Used to make the accounts.
- Content: as per site standard
- File: /root/glitecfg/vo.d
- Notes: Just a copy of the site standard vo.d dir. Used to make the accounts.
- Content: as per site standard
- File: /etc/arc/runtime/ENV/PROXYD
- Notes: Stops error messages of one kind or another
- Content: empty
- File: /etc/init.d/nordugrid-arc-egiis
- Notes: Stops error messages of one kind or another
- Content: empty
- File: /usr/etc/globus-user-env.sh
- Notes: Stops error messages of one kind or another
- Customise: No.
- Content: empty
Cron jobs
I had to add these cron jobs, illustrated with puppet stanzas.
- Cron: jura
- Purpose: Run the jura APEL reporter now and again
- Puppet stanza:
cron { "jura": # DEBUG DEBUG DEBUG DEBUG #ensure => absent, command => "/usr/libexec/arc/jura /var/spool/arc/jobstatus &>> /var/log/arc/jura.log", user => root, hour => 6, minute => 16 }
- Cron: defrag
- Purpose: Sets the defrag parameters dynamically
- Puppet stanza:
cron { "set_defrag_parameters.sh": command => "/root/scripts/set_defrag_parameters.sh >> /var/log/set_defrag_parameters.log", require => File["/root/scripts/set_defrag_parameters.sh"], user => root, minute => "*/5", hour => "*", monthday => "*", month => "*", weekday => "*", }
Special notes
After installing the Apel package, I had to make these changes by hand. On line 136 of the /usr/libexec/arc/ssmsend file, I had to add a parameter ; use_ssl = _use_ssl.
Notes on HEPSPEC Publishing Parameters
blah blah
Install the LSC Files
I used VomsSnooper to do this as follows.
# cd /opt/GridDevel/vomssnooper/usecases/getLSCRecords # sed -i -e \"s/ vomsdir/ \/etc\/grid-security\/vomsdir/g\" getLSCRecords.sh # getLSCRecords.sh
Make the user accounts
I used Yaim to do this as follows.
# yaim -r -s /root/glitecfg/site-info.def -n ABC -f config_users
Services
I had to set the services on.
A-rex - the ARC CE service condor - the Condor batch system service nordugrid-arc-ldap-infosys – part of the bdii nordugrid-arc-slapd – part of the bdii nordugrid-arc-bdii – part of the bdii gridftpd – the gridftp service
And that was it. That's all I did to get the server working, as far as I can recall.
Worker Node
blah blah blah
Performance/Tuning
blah blah blah
Further Work
blah blah blah