Difference between revisions of "A quick guide to HTCondor"
(16 intermediate revisions by one user not shown) | |||
Line 2: | Line 2: | ||
== Basic commands == | == Basic commands == | ||
− | Firstly, some basic HTCondor commands are as follows. To submit a job | + | Firstly, some basic HTCondor commands are as follows. To submit a job type: |
<pre> | <pre> | ||
Line 8: | Line 8: | ||
</pre> | </pre> | ||
− | To list | + | To list running and idle jobs type: |
<pre> | <pre> | ||
Line 14: | Line 14: | ||
</pre> | </pre> | ||
− | To list completed jobs | + | To list completed jobs type: |
<pre> | <pre> | ||
condor_history | condor_history | ||
+ | </pre> | ||
+ | |||
+ | Note that lcgui03 and lcgui04 each have their own HTCondor schedd daemon, i.e. job queue. This is means that <code>condor_q</code> and <code>condor_history</code> will only show jobs which were submitted on the host where you ran the query. However, if you run this command: | ||
+ | <pre> | ||
+ | condor_q -global <username> | ||
+ | </pre> | ||
+ | it will show idle and running jobs submitted to any schedd. For example: | ||
+ | <pre> | ||
+ | -bash-4.1$ condor_q -global alahiff | ||
+ | |||
+ | |||
+ | -- Schedd: lcgui04.gridpp.rl.ac.uk : <130.246.181.132:25365?... | ||
+ | ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD | ||
+ | 8212.0 alahiff 2/26 14:11 0+00:00:00 I 0 0.0 script.sh 1000 | ||
+ | |||
+ | 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended | ||
+ | |||
+ | |||
+ | -- Schedd: lcgui03.gridpp.rl.ac.uk : <130.246.180.41:33754?... | ||
+ | ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD | ||
+ | 91961.0 alahiff 2/26 14:10 0+00:00:09 R 0 0.0 script.sh 1000 | ||
+ | |||
+ | 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended | ||
</pre> | </pre> | ||
Line 24: | Line 47: | ||
Create a file called <code>simplejob.sub</code> containing: | Create a file called <code>simplejob.sub</code> containing: | ||
<pre> | <pre> | ||
− | + | executable=script.sh | |
arguments=10 | arguments=10 | ||
output=job.$(cluster).$(process).out | output=job.$(cluster).$(process).out | ||
Line 31: | Line 54: | ||
should_transfer_files = YES | should_transfer_files = YES | ||
when_to_transfer_output = ON_EXIT | when_to_transfer_output = ON_EXIT | ||
− | + | request_memory=100 | |
queue | queue | ||
</pre> | </pre> | ||
Line 39: | Line 62: | ||
sleep $1 | sleep $1 | ||
hostname | hostname | ||
− | |||
− | |||
− | |||
− | |||
</pre> | </pre> | ||
Line 64: | Line 83: | ||
</pre> | </pre> | ||
+ | Explanation of the content of <code>script.sh</code>: | ||
+ | * <code>executable=script.sh</code>: the job will execute the script <code>script.sh</code> | ||
+ | * <code>arguments=10</code>: the argument <code>10</code> will be passed to the executable when it is run | ||
+ | * <code>request_memory=100</code>: request 100MB memory for the job | ||
+ | * <code>should_transfer_files = YES</code>: tells HTCondor to transfer files to/from the worker node | ||
+ | * <code>when_to_transfer_output = ON_EXIT</code>: tells HTCondor transfers any output files only when the job has completed | ||
+ | * <code>output=job.$(cluster).$(process).out</code>: tells HTCondor the path (and name) of the file containing the job's stderr on the submit machine | ||
+ | * <code>error=job.$(cluster).$(process).err</code>: tells HTCondor the path (and name) of the file containing the job's stdout on the submit machine | ||
+ | * <code>log=job.$(cluster).$(process).log</code>: tells HTCondor where to write the job event log and the name of the file on the submit machine | ||
− | + | Once the job has completed there will be 3 files visible, containing the log, stdout and stderr: | |
− | + | <pre> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
-bash-4.1$ ls -lt *91959* | -bash-4.1$ ls -lt *91959* | ||
-rw-r--r-- 1 alahiff esc 1032 Feb 26 13:09 job.91959.0.log | -rw-r--r-- 1 alahiff esc 1032 Feb 26 13:09 job.91959.0.log | ||
-rw-r--r-- 1 alahiff esc 24 Feb 26 13:09 job.91959.0.out | -rw-r--r-- 1 alahiff esc 24 Feb 26 13:09 job.91959.0.out | ||
-rw-r--r-- 1 alahiff esc 0 Feb 26 13:08 job.91959.0.err | -rw-r--r-- 1 alahiff esc 0 Feb 26 13:08 job.91959.0.err | ||
+ | </pre> | ||
+ | The log file will in fact be generated as soon as the job is submitted to HTCondor and will be updated regularly. | ||
− | + | == Input & output files == | |
+ | We don't use s shared filesystem on the worker nodes. Instead we rely on HTCondor to transfer files between the submit machine (e.g. lcgui03 or lcgui04) and the worker node. By default the executable will always be transferred automatically to the worker nodes. | ||
+ | |||
+ | Note that with the example job description file above, all files generated by the job will be automatically transferred back to the machine where you submitted the job. You can prevent this from happening by adding the following to the job description file: | ||
+ | |||
+ | <pre> | ||
+TransferOutput="" | +TransferOutput="" | ||
+ | </pre> | ||
+ | |||
+ | If there are specific output files you want copied back to the submit machine you can specify these using <code>transfer_output_files</code>, for example: | ||
+ | |||
+ | <pre> | ||
+ | transfer_output_files = outputfile1,outputfile2 | ||
+ | </pre> | ||
If the job needs additional files, you can add a line something like this: | If the job needs additional files, you can add a line something like this: | ||
+ | <pre> | ||
transfer_input_files = input1.dat,input2.dat | transfer_input_files = input1.dat,input2.dat | ||
+ | </pre> | ||
− | |||
− | + | == Official documentation == | |
http://research.cs.wisc.edu/htcondor/manual/v8.4/index.html | http://research.cs.wisc.edu/htcondor/manual/v8.4/index.html | ||
− | + | Information about submitting jobs http://research.cs.wisc.edu/htcondor/manual/v8.4/condor_submit.html#man-condor-submit | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | # | + | |
− | + | ||
− | + |
Latest revision as of 14:25, 26 February 2016
Basic commands
Firstly, some basic HTCondor commands are as follows. To submit a job type:
condor_submit <file>
To list running and idle jobs type:
condor_q
To list completed jobs type:
condor_history
Note that lcgui03 and lcgui04 each have their own HTCondor schedd daemon, i.e. job queue. This is means that condor_q
and condor_history
will only show jobs which were submitted on the host where you ran the query. However, if you run this command:
condor_q -global <username>
it will show idle and running jobs submitted to any schedd. For example:
-bash-4.1$ condor_q -global alahiff -- Schedd: lcgui04.gridpp.rl.ac.uk : <130.246.181.132:25365?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 8212.0 alahiff 2/26 14:11 0+00:00:00 I 0 0.0 script.sh 1000 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended -- Schedd: lcgui03.gridpp.rl.ac.uk : <130.246.180.41:33754?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 91961.0 alahiff 2/26 14:10 0+00:00:09 R 0 0.0 script.sh 1000 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Job submission
Create a file called simplejob.sub
containing:
executable=script.sh arguments=10 output=job.$(cluster).$(process).out error=job.$(cluster).$(process).err log=job.$(cluster).$(process).log should_transfer_files = YES when_to_transfer_output = ON_EXIT request_memory=100 queue
and a script called script.sh
containing:
#!/bin/sh sleep $1 hostname
Submit the job:
-bash-4.1$ condor_submit simplejob.sub Submitting job(s). 1 job(s) submitted to cluster 91959.
Checking the status of the job:
-bash-4.1$ condor_q -- Schedd: lcgui03.gridpp.rl.ac.uk : <130.246.180.41:33754?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 91959.0 alahiff 2/26 13:08 0+00:00:07 R 0 0.0 script.sh 10 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Explanation of the content of script.sh
:
-
executable=script.sh
: the job will execute the scriptscript.sh
-
arguments=10
: the argument10
will be passed to the executable when it is run -
request_memory=100
: request 100MB memory for the job -
should_transfer_files = YES
: tells HTCondor to transfer files to/from the worker node -
when_to_transfer_output = ON_EXIT
: tells HTCondor transfers any output files only when the job has completed -
output=job.$(cluster).$(process).out
: tells HTCondor the path (and name) of the file containing the job's stderr on the submit machine -
error=job.$(cluster).$(process).err
: tells HTCondor the path (and name) of the file containing the job's stdout on the submit machine -
log=job.$(cluster).$(process).log
: tells HTCondor where to write the job event log and the name of the file on the submit machine
Once the job has completed there will be 3 files visible, containing the log, stdout and stderr:
-bash-4.1$ ls -lt *91959* -rw-r--r-- 1 alahiff esc 1032 Feb 26 13:09 job.91959.0.log -rw-r--r-- 1 alahiff esc 24 Feb 26 13:09 job.91959.0.out -rw-r--r-- 1 alahiff esc 0 Feb 26 13:08 job.91959.0.err
The log file will in fact be generated as soon as the job is submitted to HTCondor and will be updated regularly.
Input & output files
We don't use s shared filesystem on the worker nodes. Instead we rely on HTCondor to transfer files between the submit machine (e.g. lcgui03 or lcgui04) and the worker node. By default the executable will always be transferred automatically to the worker nodes.
Note that with the example job description file above, all files generated by the job will be automatically transferred back to the machine where you submitted the job. You can prevent this from happening by adding the following to the job description file:
+TransferOutput=""
If there are specific output files you want copied back to the submit machine you can specify these using transfer_output_files
, for example:
transfer_output_files = outputfile1,outputfile2
If the job needs additional files, you can add a line something like this:
transfer_input_files = input1.dat,input2.dat
Official documentation
http://research.cs.wisc.edu/htcondor/manual/v8.4/index.html
Information about submitting jobs http://research.cs.wisc.edu/htcondor/manual/v8.4/condor_submit.html#man-condor-submit