Difference between revisions of "ARC HTCondor Basic Install"

Latest revision as of 10:50, 20 February 2019

NOTE: This is kept for historical reasons. There is a more complete example here Example_Build_of_an_ARC/Condor_Cluster

This page explains how to setup a minimal ARC CE and HTCondor pool. In order to be as simple as possible the CE, HTCondor central manager and worker node are setup on a single machine. This is of course not suitable for a production system, but allows people new to ARC and/or HTCondor to quickly have a fully functioning system for testing.

Prerequisites

Prepare an SL6 VM with a valid host certificate. EMI and UMD repositories should not be configured.

ARC CE installation

YUM repository configuration for EPEL and NorduGrid:

rpm -Uvh https://anorien.csc.warwick.ac.uk/mirrors/epel/6/x86_64/epel-release-6-8.noarch.rpm
rpm -Uvh http://download.nordugrid.org/packages/nordugrid-release/releases/13.11/centos/el6/x86_64/nordugrid-release-13.11-1.el6.noarch.rpm

Install the ARC CE meta-package:

yum install nordugrid-arc-compute-element

This will install the NorduGrid 4.1.0 RPMs as well as any dependencies. For HTCondor it is important to use 4.1.0 (or above).

HTCondor installation

Setup the YUM repository:

cd /etc/yum.repos.d/
wget http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo

Install the most recent stable version of HTCondor, currently 8.0.6:

yum install condor

HTCondor configuration

Configure HTCondor to use partitionable slots (by default static slots would be used). Create a file /etc/condor/config.d/00-slots containing the following:

NUM_SLOTS = 1
SLOT_TYPE_1               = cpus=100%,mem=100%,auto
NUM_SLOTS_TYPE_1          = 1
SLOT_TYPE_1_PARTITIONABLE = TRUE

Start HTCondor by running:

service condor start

ARC CE configuration

Create the required control and session directories:

mkdir -p /var/spool/arc/jobstatus
mkdir -p /var/spool/arc/grid

Create a simple grid-mapfile for testing, for example /etc/grid-security/grid-mapfile containing, for example:

"/C=UK/O=eScience/OU=CLRC/L=RAL/CN=andrew lahiff" pcms001

Replace the DN and user id here as necessary, and create the user account.

Create a minimal configuration file /etc/arc.conf:

[common]
x509_user_key="/etc/grid-security/hostkey.pem"
x509_user_cert="/etc/grid-security/hostcert.pem"
x509_cert_dir="/etc/grid-security/certificates"
gridmap="/etc/grid-security/grid-mapfile"
lrms="condor" 

[grid-manager]
user="root"
controldir="/var/spool/arc/jobstatus"
sessiondir="/var/spool/arc/grid"
runtimedir="/etc/arc/runtime"
logfile="/var/log/arc/grid-manager.log"
pidfile="/var/run/grid-manager.pid"
joblog="/var/log/arc/gm-jobs.log"
shared_filesystem="no" 

[gridftpd]
user="root"
logfile="/var/log/arc/gridftpd.log"
pidfile="/var/run/gridftpd.pid"
port="2811"
allowunknown="no" 

[gridftpd/jobs]
path="/jobs"
plugin="jobplugin.so"
allownew="yes" 

[infosys]
user="root"
overwrite_config="yes"
port="2135"
registrationlog="/var/log/arc/inforegistration.log"
providerlog="/var/log/arc/infoprovider.log" 

[cluster]
cluster_alias="MINIMAL Computing Element"
comment="This is a minimal out-of-box CE setup"
homogeneity="True"
architecture="adotf"
nodeaccess="outbound"
authorizedvo="cms" 

[queue/grid]
name="grid"
homogeneity="True"
comment="Default queue"
nodecpu="adotf"
architecture="adotf"
defaultmemory="1000"

Start the GridFTP server, A-REX service and LDAP information system:

service gridftpd start
service a-rex start
service nordugrid-arc-ldap-infosys start

The ARC CE and HTCondor pool is now ready.

Testing

Firstly, check the HTCondor is working correctly:

[root@lcgvm21 ~]# condor_status -any
MyType             TargetType         Name                                     
Collector          None               Personal Condor at lcgvm21.gridpp.rl.ac.u
Scheduler          None               lcgvm21.gridpp.rl.ac.uk                  
DaemonMaster       None               lcgvm21.gridpp.rl.ac.uk                  
Negotiator         None               lcgvm21.gridpp.rl.ac.uk                  
Machine            Job                slot1@lcgvm21.gridpp.rl.ac.uk

Usually the 'Collector' and 'Negotiator' would be running on a machine designated as the central manager and the 'Scheduler' would be running on the CE. Here 'Machine' corresponds to a resource able to run jobs, i.e. a worker node. Every machine running HTCondor in addition has a 'Master' daemon running which takes care of all other HTCondor daemons running on it.

From a standard UI (NorduGrid client rpms are part of the standard EMI UI), check the status of the newly-installed ARC CE:

-bash-4.1$ arcinfo -c lcgvm21.gridpp.rl.ac.uk
Computing service: MINIMAL Computing Element (production)
  Information endpoint: ldap://lcgvm21.gridpp.rl.ac.uk:2135/Mds-Vo-Name=local,o=grid
  Information endpoint: ldap://lcgvm21.gridpp.rl.ac.uk:2135/o=glue
  Submission endpoint: gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs (status: ok, interface: org.nordugrid.gridftpjob)

Note that it may take a few minutes for the information system to be available after services are started.

Try submitting a test job using arctest, which creates and submits predefined test jobs:

-bash-4.1$ arctest -c lcgvm21.gridpp.rl.ac.uk -J 1
Test submitted with jobid: gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/Um0NDmEkj2jnvMODjqAWcw5nABFKDmABFKDmOpOKDmABFKDmTCef4m

Check the status of the job. If you do this before the information system has been updated, you will see a response like this

-bash-4.1$ arcstat gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/Um0NDmEkj2jnvMODjqAWcw5nABFKDmABFKDmOpOKDmABFKDmTCef4m
WARNING: Job information not found in the information system: gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/Um0NDmEkj2jnvMODjqAWcw5nABFKDmABFKDmOpOKDmABFKDmTCef4m
WARNING: This job was very recently submitted and might not yet have reached the information system
No jobs

When the job has finished running you should see this:

-bash-4.1$ arcstat gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/Um0NDmEkj2jnvMODjqAWcw5nABFKDmABFKDmOpOKDmABFKDmTCef4m
Job: gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/Um0NDmEkj2jnvMODjqAWcw5nABFKDmABFKDmOpOKDmABFKDmTCef4m
 Name: arctest1
 State: Finished (FINISHED)
 Exit Code: 0

The job's output can also be obtained easily:

-bash-4.1$ arcget gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/Um0NDmEkj2jnvMODjqAWcw5nABFKDmABFKDmOpOKDmABFKDmTCef4m
Results stored at: Um0NDmEkj2jnvMODjqAWcw5nABFKDmABFKDmOpOKDmABFKDmTCef4m
Jobs processed: 1, successfully retrieved: 1, successfully cleaned: 1

You can of course also create your own jobs. Create a file, e.g. test.xrsl, containing:

&(executable="test.sh")
(stdout="test.out")
(stderr="test.err")
(jobname="ARC-HTCondor test")

and create an executable test.sh, for example:

#!/bin/sh
printenv

Submit the job:

-bash-4.1$ arcsub -c lcgvm21.gridpp.rl.ac.uk test.xrsl 
Job submitted with jobid: gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/D7rNDmOKk2jnvMODjqAWcw5nABFKDmABFKDmoqQKDmABFKDmvdxuEn

The job should soon finish:

-bash-4.1$ arcstat gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/D7rNDmOKk2jnvMODjqAWcw5nABFKDmABFKDmoqQKDmABFKDmvdxuEn
Job: gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/D7rNDmOKk2jnvMODjqAWcw5nABFKDmABFKDmoqQKDmABFKDmvdxuEn
 Name: ARC-HTCondor test
 State: Finished (FINISHED)
 Exit Code: 0

and the output can then be retrieved:

-bash-4.1$ arcget gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/D7rNDmOKk2jnvMODjqAWcw5nABFKDmABFKDmoqQKDmABFKDmvdxuEn
Results stored at: D7rNDmOKk2jnvMODjqAWcw5nABFKDmABFKDmoqQKDmABFKDmvdxuEn
Jobs processed: 1, successfully retrieved: 1, successfully cleaned: 1 

-bash-4.1$ ls D7rNDmOKk2jnvMODjqAWcw5nABFKDmABFKDmoqQKDmABFKDmvdxuEn
test.err  test.out

You can check the status of jobs by running condor_q on the CE. For example, after submitting a number of sleep jobs, you might see something like this:

[root@lcgvm21 ~]# condor_q 


-- Submitter: lcgvm21.gridpp.rl.ac.uk : <130.246.181.102:52146> : lcgvm21.gridpp.rl.ac.uk
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   3.0   pcms001         5/8  16:50   0+00:00:59 R  0   0.0  (ARC_HTCondor_te  )
   4.0   pcms001         5/8  16:51   0+00:00:00 I  0   0.0  (ARC_HTCondor_te  )
   5.0   pcms001         5/8  16:51   0+00:00:00 I  0   0.0  (ARC_HTCondor_te  )
   6.0   pcms001         5/8  16:51   0+00:00:00 I  0   0.0  (ARC_HTCondor_te  )
   7.0   pcms001         5/8  16:51   0+00:00:00 I  0   0.0  (ARC_HTCondor_te  )

5 jobs; 0 completed, 0 removed, 4 idle, 1 running, 0 held, 0 suspended

You can see the status of worker nodes by running condor_status on the CE. For example:

[root@lcgvm21 ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime 

slot1@lcgvm21.grid LINUX      X86_64 Unclaimed Idle      0.110  845  0+00:00:04
slot1_1@lcgvm21.gr LINUX      X86_64 Claimed   Busy      0.000 1024  0+00:00:03
                     Total Owner Claimed Unclaimed Matched Preempting Backfill 

        X86_64/LINUX     2     0       1         1       0          0        0 

               Total     2     0       1         1       0          0        0

In this case the worker node is of course the same machine as the CE. Note that here "slot1" is a partitionable slot, which contains all the resources (CPU, memory, etc) of the worker node. This slot will always be idle. "slot1_1" is an example of a dynamic slot, which are created automatically and actually run jobs. When there are no running jobs on the worker node you will see just the single partitionable slot:

[root@lcgvm21 ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime 
  
slot1@lcgvm21.grid LINUX      X86_64 Unclaimed Idle      0.100  845  0+00:19:30 

                     Total Owner Claimed Unclaimed Matched Preempting Backfill 

        X86_64/LINUX     1     0       0         1       0          0        0 

               Total     1     0       0         1       0          0        0

Difference between revisions of "ARC HTCondor Basic Install"

Latest revision as of 10:50, 20 February 2019

Contents

Prerequisites

ARC CE installation

HTCondor installation

HTCondor configuration

ARC CE configuration

Testing

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools

@@ Line 1: / Line 1: @@
-This page explains how to setup a minimal ARC CE and HTCondor pool on a single node.
+NOTE: This is kept for historical reasons. There is  a more complete example here [[Example_Build_of_an_ARC/Condor_Cluster]]
+This page explains how to setup a minimal ARC CE and HTCondor pool.  In order to be as simple as possible the CE, HTCondor central manager and worker node are setup on a single machine. This is of course not suitable for a production system, but allows people new to ARC and/or HTCondor to quickly have a fully functioning system for testing.
+== Prerequisites ==
+Prepare an SL6 VM with a valid host certificate. EMI and UMD repositories should not be configured.
 == ARC CE installation ==
@@ Line 9: / Line 14: @@
 Install the ARC CE meta-package:
   yum install nordugrid-arc-compute-element
+This will install the NorduGrid 4.1.0 RPMs as well as any dependencies. For HTCondor it is important to use 4.1.0 (or above).
 == HTCondor installation ==
@@ Line 16: / Line 22: @@
   wget http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo
-Install the most recent stable version of HTCondor:
+Install the most recent stable version of HTCondor, currently 8.0.6:
   yum install condor
 == HTCondor configuration ==
-Configure HTCondor to use partitionable slots. Create a file =/etc/condor/config.d/00slots= containing the following:
+Configure HTCondor to use partitionable slots (by default static slots would be used). Create a file ''/etc/condor/config.d/00-slots'' containing the following:
   NUM_SLOTS = 1
   SLOT_TYPE_1               = cpus=100%,mem=100%,auto
@@ Line 27: / Line 33: @@
   SLOT_TYPE_1_PARTITIONABLE = TRUE
-Start HTCondor by running
+Start HTCondor by running:
   service condor start
+== ARC CE configuration ==
+Create the required ''control'' and ''session'' directories:
+ mkdir -p /var/spool/arc/jobstatus
+ mkdir -p /var/spool/arc/grid
+Create a simple grid-mapfile for testing, for example ''/etc/grid-security/grid-mapfile'' containing, for example:
+ "/C=UK/O=eScience/OU=CLRC/L=RAL/CN=andrew lahiff" pcms001
+Replace the DN and user id here as necessary, and create the user account.
+Create a minimal configuration file ''/etc/arc.conf'':
+ [common]
+ x509_user_key="/etc/grid-security/hostkey.pem"
+ x509_user_cert="/etc/grid-security/hostcert.pem"
+ x509_cert_dir="/etc/grid-security/certificates"
+ gridmap="/etc/grid-security/grid-mapfile"
+ lrms="condor" <br>
+ [grid-manager]
+ user="root"
+ controldir="/var/spool/arc/jobstatus"
+ sessiondir="/var/spool/arc/grid"
+ runtimedir="/etc/arc/runtime"
+ logfile="/var/log/arc/grid-manager.log"
+ pidfile="/var/run/grid-manager.pid"
+ joblog="/var/log/arc/gm-jobs.log"
+ shared_filesystem="no" <br>
+ [gridftpd]
+ user="root"
+ logfile="/var/log/arc/gridftpd.log"
+ pidfile="/var/run/gridftpd.pid"
+ port="2811"
+ allowunknown="no" <br>
+ [gridftpd/jobs]
+ path="/jobs"
+ plugin="jobplugin.so"
+ allownew="yes" <br>
+ [infosys]
+ user="root"
+ overwrite_config="yes"
+ port="2135"
+ registrationlog="/var/log/arc/inforegistration.log"
+ providerlog="/var/log/arc/infoprovider.log" <br>
+ [cluster]
+ cluster_alias="MINIMAL Computing Element"
+ comment="This is a minimal out-of-box CE setup"
+ homogeneity="True"
+ architecture="adotf"
+ nodeaccess="outbound"
+ authorizedvo="cms" <br>
+ [queue/grid]
+ name="grid"
+ homogeneity="True"
+ comment="Default queue"
+ nodecpu="adotf"
+ architecture="adotf"
+ defaultmemory="1000"
+Start the GridFTP server, A-REX service and LDAP information system:
+ service gridftpd start
+ service a-rex start
+ service nordugrid-arc-ldap-infosys start
+The ARC CE and HTCondor pool is now ready.
+== Testing ==
+Firstly, check the HTCondor is working correctly:
+ [root@lcgvm21 ~]# condor_status -any
+ MyType             TargetType         Name
+ Collector          None               Personal Condor at lcgvm21.gridpp.rl.ac.u
+ Scheduler          None               lcgvm21.gridpp.rl.ac.uk
+ DaemonMaster       None               lcgvm21.gridpp.rl.ac.uk
+ Negotiator         None               lcgvm21.gridpp.rl.ac.uk
+ Machine            Job                slot1@lcgvm21.gridpp.rl.ac.uk
+Usually the 'Collector' and 'Negotiator' would be running on a machine designated as the central manager and the 'Scheduler' would be running on the CE. Here 'Machine' corresponds to a resource able to run jobs, i.e. a worker node. Every machine running HTCondor in addition has a 'Master' daemon running which takes care of all other HTCondor daemons running on it.
+From a standard UI (NorduGrid client rpms are part of the standard EMI UI), check the status of the newly-installed ARC CE:
+ -bash-4.1$ arcinfo -c lcgvm21.gridpp.rl.ac.uk
+ Computing service: MINIMAL Computing Element (production)
+   Information endpoint: ldap://lcgvm21.gridpp.rl.ac.uk:2135/Mds-Vo-Name=local,o=grid
+   Information endpoint: ldap://lcgvm21.gridpp.rl.ac.uk:2135/o=glue
+   Submission endpoint: gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs (status: ok, interface: org.nordugrid.gridftpjob)
+Note that it may take a few minutes for the information system to be available after services are started.
+Try submitting a test job using ''arctest'', which creates and submits predefined test jobs:
+ -bash-4.1$ arctest -c lcgvm21.gridpp.rl.ac.uk -J 1
+ Test submitted with jobid: gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/Um0NDmEkj2jnvMODjqAWcw5nABFKDmABFKDmOpOKDmABFKDmTCef4m
+Check the status of the job. If you do this before the information system has been updated, you will see a response like this
+ -bash-4.1$ arcstat gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/Um0NDmEkj2jnvMODjqAWcw5nABFKDmABFKDmOpOKDmABFKDmTCef4m
+ WARNING: Job information not found in the information system: gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/Um0NDmEkj2jnvMODjqAWcw5nABFKDmABFKDmOpOKDmABFKDmTCef4m
+ WARNING: This job was very recently submitted and might not yet have reached the information system
+ No jobs
+When the job has finished running you should see this:
+ -bash-4.1$ arcstat gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/Um0NDmEkj2jnvMODjqAWcw5nABFKDmABFKDmOpOKDmABFKDmTCef4m
+ Job: gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/Um0NDmEkj2jnvMODjqAWcw5nABFKDmABFKDmOpOKDmABFKDmTCef4m
+  Name: arctest1
+  State: Finished (FINISHED)
+  Exit Code: 0
+The job's output can also be obtained easily:
+ -bash-4.1$ arcget gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/Um0NDmEkj2jnvMODjqAWcw5nABFKDmABFKDmOpOKDmABFKDmTCef4m
+ Results stored at: Um0NDmEkj2jnvMODjqAWcw5nABFKDmABFKDmOpOKDmABFKDmTCef4m
+ Jobs processed: 1, successfully retrieved: 1, successfully cleaned: 1
+You can of course also create your own jobs. Create a file, e.g. ''test.xrsl'', containing:
+ &(executable="test.sh")
+ (stdout="test.out")
+ (stderr="test.err")
+ (jobname="ARC-HTCondor test")
+and create an executable ''test.sh'', for example:
+ #!/bin/sh
+ printenv
+Submit the job:
+ -bash-4.1$ arcsub -c lcgvm21.gridpp.rl.ac.uk test.xrsl
+ Job submitted with jobid: gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/D7rNDmOKk2jnvMODjqAWcw5nABFKDmABFKDmoqQKDmABFKDmvdxuEn
+The job should soon finish:
+ -bash-4.1$ arcstat gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/D7rNDmOKk2jnvMODjqAWcw5nABFKDmABFKDmoqQKDmABFKDmvdxuEn
+ Job: gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/D7rNDmOKk2jnvMODjqAWcw5nABFKDmABFKDmoqQKDmABFKDmvdxuEn
+  Name: ARC-HTCondor test
+  State: Finished (FINISHED)
+  Exit Code: 0
+and the output can then be retrieved:
+ -bash-4.1$ arcget gsiftp://lcgvm21.gridpp.rl.ac.uk:2811/jobs/D7rNDmOKk2jnvMODjqAWcw5nABFKDmABFKDmoqQKDmABFKDmvdxuEn
+ Results stored at: D7rNDmOKk2jnvMODjqAWcw5nABFKDmABFKDmoqQKDmABFKDmvdxuEn
+ Jobs processed: 1, successfully retrieved: 1, successfully cleaned: 1 <br>
+ -bash-4.1$ ls D7rNDmOKk2jnvMODjqAWcw5nABFKDmABFKDmoqQKDmABFKDmvdxuEn
+ test.err  test.out
+You can check the status of jobs by running ''condor_q'' on the CE. For example, after submitting a number of sleep jobs, you might see something like this:
+ [root@lcgvm21 ~]# condor_q <br><br>
+ -- Submitter: lcgvm21.gridpp.rl.ac.uk : <130.246.181.102:52146> : lcgvm21.gridpp.rl.ac.uk
+  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
+.0   pcms001         5/8  16:50   0+00:00:59 R  0   0.0  (ARC_HTCondor_te  )
+.0   pcms001         5/8  16:51   0+00:00:00 I  0   0.0  (ARC_HTCondor_te  )
+.0   pcms001         5/8  16:51   0+00:00:00 I  0   0.0  (ARC_HTCondor_te  )
+.0   pcms001         5/8  16:51   0+00:00:00 I  0   0.0  (ARC_HTCondor_te  )
+.0   pcms001         5/8  16:51   0+00:00:00 I  0   0.0  (ARC_HTCondor_te  )<br>
+jobs; 0 completed, 0 removed, 4 idle, 1 running, 0 held, 0 suspended
+You can see the status of worker nodes by running ''condor_status'' on the CE. For example:
+ [root@lcgvm21 ~]# condor_status
+ Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime <br>
+ slot1@lcgvm21.grid LINUX      X86_64 Unclaimed Idle      0.110  845  0+00:00:04
+ slot1_1@lcgvm21.gr LINUX      X86_64 Claimed   Busy      0.000 1024  0+00:00:03
+                      Total Owner Claimed Unclaimed Matched Preempting Backfill <br>
+         X86_64/LINUX     2     0       1         1       0          0        0 <br>
+                Total     2     0       1         1       0          0        0
+In this case the worker node is of course the same machine as the CE. Note that here "slot1" is a partitionable slot, which contains all the resources (CPU, memory, etc) of the worker node. This slot will always be idle. "slot1_1" is an example of a dynamic slot, which are created automatically and actually run jobs. When there are no running jobs on the worker node you will see just the single partitionable slot:
+ [root@lcgvm21 ~]# condor_status
+ Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime <br>
+ slot1@lcgvm21.grid LINUX      X86_64 Unclaimed Idle      0.100  845  0+00:19:30 <br>
+                      Total Owner Claimed Unclaimed Matched Preempting Backfill <br>
+         X86_64/LINUX     1     0       0         1       0          0        0 <br>
+                Total     1     0       0         1       0          0        0
+[[Category:arcce]]
+[[Category:HTcondor]]