Difference between revisions of "A quick guide to HTCondor"

From GridPP Wiki
Jump to: navigation, search
(Job submission)
Line 101: Line 101:
 
</pre>
 
</pre>
  
== Input/output files ==
+
== Input & output files ==
  
Note that with this example job description file, all files generated by the job will be automatically transferred back to the machine where you submitted the job (i.e. lcgui03 or lcgui04). You can prevent this from happening by adding:
+
Note that with the example job description file above, all files generated by the job will be automatically transferred back to the machine where you submitted the job (i.e. lcgui03 or lcgui04). You can prevent this from happening by adding the following to the job description file:
  
 +
<pre>
 
+TransferOutput=""
 
+TransferOutput=""
 +
</pre>
 +
 +
If there are specific output files you want copied back to the submit machine you can specify these using <code>transfer_output_files</code>, for example:
 +
 +
<pre>
 +
transfer_output_files = outputfile1,outputfile2
 +
</pre>
  
 
If the job needs additional files, you can add a line something like this:
 
If the job needs additional files, you can add a line something like this:
  
 +
<pre>
 
transfer_input_files = input1.dat,input2.dat
 
transfer_input_files = input1.dat,input2.dat
 +
</pre>
  
and they will be copied.
 
  
 
== Official documentation ==
 
== Official documentation ==
Line 118: Line 127:
  
 
Information about submitting jobs http://research.cs.wisc.edu/htcondor/manual/v8.4/condor_submit.html#man-condor-submit
 
Information about submitting jobs http://research.cs.wisc.edu/htcondor/manual/v8.4/condor_submit.html#man-condor-submit
 
Regards,
 
Andrew.
 

Revision as of 14:19, 26 February 2016

Basic commands

Firstly, some basic HTCondor commands are as follows. To submit a job type:

condor_submit <file>

To list running and idle jobs type:

condor_q

To list completed jobs type:

condor_history

Note that lcgui03 and lcgui04 each have their own HTCondor schedd daemon, i.e. job queue. This is means that condor_q and condor_history will only show jobs which were submitted on the host where you ran the query. However, if you run this command:

condor_q -global <username>

it will show idle and running jobs submitted to any schedd. For example:

-bash-4.1$ condor_q -global alahiff


-- Schedd: lcgui04.gridpp.rl.ac.uk : <130.246.181.132:25365?...
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
8212.0   alahiff         2/26 14:11   0+00:00:00 I  0   0.0  script.sh 1000

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended


-- Schedd: lcgui03.gridpp.rl.ac.uk : <130.246.180.41:33754?...
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
91961.0   alahiff         2/26 14:10   0+00:00:09 R  0   0.0  script.sh 1000

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

Job submission

Create a file called simplejob.sub containing:

cmd=script.sh
arguments=10
output=job.$(cluster).$(process).out
error=job.$(cluster).$(process).err
log=job.$(cluster).$(process).log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
request_memory=100
queue

and a script called script.sh containing:

#!/bin/sh
sleep $1
hostname

Submit the job:

-bash-4.1$ condor_submit simplejob.sub
Submitting job(s).
1 job(s) submitted to cluster 91959.

Checking the status of the job:

-bash-4.1$ condor_q


-- Schedd: lcgui03.gridpp.rl.ac.uk : <130.246.180.41:33754?...
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
91959.0   alahiff         2/26 13:08   0+00:00:07 R  0   0.0  script.sh 10

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

Explanation of the content of script.sh:

  • cmd=script.sh: the job will execute the script script.sh
  • arguments=10: the argument 10 will be passed to the executable when it is run
  • request_memory=100: request 100MB memory for the job
  • should_transfer_files = YES: tells HTCondor to transfer files to/from the worker node
  • when_to_transfer_output = ON_EXIT: tells HTCondor transfers any output files only when the job has completed
  • output=job.$(cluster).$(process).out: tells HTCondor the path (and name) of the file containing the job's stderr on the submit machine
  • error=job.$(cluster).$(process).err: tells HTCondor the path (and name) of the file containing the job's stdout on the submit machine
  • log=job.$(cluster).$(process).log: tells HTCondor where to write the job event log and the name of the file on the submit machine

Once the job has completed there will be 3 files visible, containing the log, stdout and stderr:

-bash-4.1$ ls -lt *91959*
-rw-r--r-- 1 alahiff esc 1032 Feb 26 13:09 job.91959.0.log
-rw-r--r-- 1 alahiff esc   24 Feb 26 13:09 job.91959.0.out
-rw-r--r-- 1 alahiff esc    0 Feb 26 13:08 job.91959.0.err

Input & output files

Note that with the example job description file above, all files generated by the job will be automatically transferred back to the machine where you submitted the job (i.e. lcgui03 or lcgui04). You can prevent this from happening by adding the following to the job description file:

+TransferOutput=""

If there are specific output files you want copied back to the submit machine you can specify these using transfer_output_files, for example:

transfer_output_files = outputfile1,outputfile2

If the job needs additional files, you can add a line something like this:

transfer_input_files = input1.dat,input2.dat


Official documentation

http://research.cs.wisc.edu/htcondor/manual/v8.4/index.html

Information about submitting jobs http://research.cs.wisc.edu/htcondor/manual/v8.4/condor_submit.html#man-condor-submit