BaBar: bbrbsub270

From GridPP Wiki
Jump to: navigation, search


Aim

This project is to extend the bbrbsub command already used to submit jobs to RAL to be able to use grid submission.

Name

For the moment we call this command bbrbsub270 (it's the version 270 of the modified command, but the step for changed versions is 10 and not 1...). In future we could call it _bbrbsub, or gbbrbsub, or bbrbsubg, (where g would be for grid) or bbrbsub like the old original command.

Usage

bbrbsub270 can be used in the same identical way as the bbrbsub normal and standard command, it's compatible with the normal standard original version. Then, for using it, you have to set the same environment that you have to set for the bbrbsub original command.

Example 1:

[lcgui01] ~/ana30/workdir > ~ > bbrbsub270 BetaMiniApp run-A0-Run5-OffPeak-R18b-1.tcl

But, now, you can add the following new parameters:

--grid

--bring

--klog

The order of the parameters is not important.

Example 2:

[lcgui01] ~/ana30/workdir > ~ > bbrbsub270 --grid BetaMiniApp run-A0-Run5-OffPeak-R18b-1.tcl

Example 3:

[lcgui01] ~/ana30/workdir > ~ > bbrbsub270 --grid --bring BetaMiniApp run-A0-Run5-OffPeak-R18b-1.tcl

Example 4:

[lcgui01] ~/ana30/workdir > ~ > bbrbsub270 --grid --bring --klog BetaMiniApp run-A0-Run5-OffPeak-R18b-1.tcl

--grid parameter

Now the job will be sent to the grid. Remember that you must have a valid grid proxy to allow this parameter to properly work: that is, you have to be on a grid user interface machine, and you must have typed the grid-proxy-init command.

The output file will be, instead of the standard for instance BetaMiniApp.o2037286 file, something called for instance BetaMiniApp.go_2037009_3OMjVC2uF7hJbj9dY4Ze3A@lcgrb01.gridpp.rl.ac.uk , but the information inside the output, apart for the Prologue and the Epilogue infos, is the same.

To allow the job (which does not run under your userid) to write output to your directory you have to make it write-enabled, (e.g. by the command chmod 775 .) The dangers of this are obvious - better use a subdirectory you can afford to lose.

--bring parameter

It works, for now, only if --grid is set up. It copies the working directory of the workernode where the job has been submitted, to the directory from where the job has been submitted, for instance /home/csf/castelli/ana30/workdir . The name of the directory will be something like dir_2037056_psCxCzejHwoksiamyjvxaw@lcgrb01.gridpp.rl.ac.uk , and the grid output ( i.e. for instance BetaMiniApp.go_2037009_3OMjVC2uF7hJbj9dY4Ze3A@lcgrb01.gridpp.rl.ac.uk ) will be put in this directory as well. We would like to extend this option also for the pbs case in future, that is to make it work also for the pbs case when the --grid parameter is not setted up.

--klog parameter

It sets up in the workernode the AFS 'gssklog -cell rl.ac.uk' token.

For this to work the map file has to be able to link your grid certificate DN to your afs account. (Otherwise anyone with a grid certificate could write to any afs user directory!) You do this by emailing support@gridpp.rl.ac.uk

--dest parameter

It sets up where the job will be sent via the grid.

The possible options are only two for now:

  1. ral (default)
  2. man

If you don't specify the --dest option the job will be sent by default to ral, otherwise you can put the '--dest ral' option and the job will be sent again to ral, or you can choose the '--dest man' option and the job will be sent to the Manchester farm.

For doing that you have to be in a AFS environment and type a command similar to:

[lcgui01] ~/afs/rl.ac.uk/user/c/castelli/ana31 > ~ > bbrbsub270 --grid --klog --dest man -f ls

-t parameter

The -t parameter was already present in the original bbrbsub version. If you consider it without the --grid parameter it has the same effect as before, but if you put it with the --grid parameter it prints on video the two shell scripts used by bbrbsub270 and doesn't submit and then run the job.

-o parameter

The -o parameter was already present in the original bbrbsub version. If you consider it without the --grid parameter it has the same effect as before, but if you put it with the --grid parameter it copies the output of the grid job to the filename put after the -o parameter. In the filename.grid.job.txt file, for eventual grid monitoring reasons from the user, it will be copied the output from the edg-job-submit command when it is executed.

grid.job.txt

For monitoring issues, if the -o option is not present, when the job is launched through the --grid option, an output from the grid environment is saved in the grid.job.txt file in the working user interface directory ( e.g. /home/csf/castelli/ana30/workdir ).

grid.job.txt file contains something like this:

*********************************************************************************************
                              JOB SUBMIT OUTCOME
The job has been successfully submitted to the Network Server.
Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is:

- https://lcgrb01.gridpp.rl.ac.uk:9000/TMjkFFop8ppECWCxSwpxFA

*********************************************************************************************

You can now monitoring the status of the job on the grid through the edg-job-status command, that is typing:

[lcgui01] ~/ana30/workdir > ~ > edg-job-status https://lcgrb01.gridpp.rl.ac.uk:9000/TMjkFFop8ppECWCxSwpxFA

If the -o option is present this same information will be saved in the filename put after the -o parameter.

bbrbsub270 and SJM

The new bbrbsub270 command can be used in the SJM framework, without any changes to the python sjm code.

Anyway, some attention has to be put in the creation of the related input sjm files. Three input files are needed:

SJMConfigFile.txt

SJMTestSnippet.tcl

Wrapper.sh

Every user has to change the path of the directories he is using for his analysis, the names of the files, etc...; then the following example files have to be adapted in every specific case.

SJMConfigFile.txt :

# File to configure a Simple Job Manager
#

# define the name of the SJM 
SJMName = SJMTest

# name of the input dataset
#DatasetName = users-phnic-TwoPhotonPentaquarkSkim-BlackDiamond-Run1
#lappend inputList /store/PRskims/R18/18.6.0b/A0/89/A0_8998%selectEventSequence=1-200000
DatasetName = A0-Run5-OffPeak-R18b 

# max. number of events per tcl file (as used by the --tcl option in 
#                                     BbkDatasetTcl)
MaxEvents = 250000

# raw command options to be passed to BbkDatasetTcl
# WARNING: SJM adds the --tcl --basename and --splitruns options. Make sure
#          that adding options does not interfere with these defaults 
#          For the new database the options my be --dbname bbkr18
BbkDatasetTclRaw = 

# Template file for the creation of the Tcl Snippet files used 
#          for job configuratin 
TclSnippet = SJMTestSnippet.tcl

# gowdy
# job wrapper script template
WrapperScript = Wrapper.sh

# Define the run directory - this is the directory where the logfile 
# and jobreport file will be written. The tag <ID> will be replaced by the 
# job specific job id. Please make sure that the run directory does not 
# overwrite already existing data.  
#RunDirectory = /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/<ID>
RunDirectory = /home/csf/castelli/ana30/workdir/SJMoutput/<ID>
# gowdy
TmpDirectory = /tmp/castelli_<NAME>_<ID>

# The name of the executable that should be run
Executable = BetaMiniApp

# This is the batch command that will be used to submit jobs to the 
# batch queue. The tags <LOG> and <TCL> will be replaced with the job 
# specific logfile name and tcl snippet name
#BatchCommand = bsub -q kanga -C 0 -o <LOG> <WRAPPER>
#BatchCommand = bbrbsub -o <LOG> <WRAPPER>
#BatchCommand = bbrbsub -l tmp -N <NAME>-<ID> -o <LOG> <WRAPPER>
# OK
#BatchCommand = bbrbsub -N <NAME>-<ID> -o <LOG> <WRAPPER>
#BatchCommand = ../../bbrbsub/bbrbsub270 --grid -N <NAME>-<ID> -o <LOG> <WRAPPER>
BatchCommand = bbrbsub270 --grid -N <NAME>-<ID> -o <LOG> <WRAPPER>

# some optional commands - useful for running on other sites but slac

# The following string at the end of the log file signals that the job 
# has completed in the batch queue 
# JobFinishedString = Resource usage summary
#JobFinishedString = Resources Used
JobFinishedString = Framework is exiting now.

# This string indicates that the job was executed successfull, i.e. exited
# with an exit code 0
# JobSuccessfulString = Successfully completed
#JobSuccessfulString = Resources Used
JobSuccessfulString  = Framework is exiting now.

# Options for running SJM with other users in a shared environemnt.
Share = 1

# Set the default mask. umask should be an integer. Overrides the
# default mask settings of 'umask 2' set by using Share
# umask = 2

# Options for use with SJMSprite: (defaults are values listed here)
#
# number of jobs to be running in the queue 
# SpriteMaxJobs = 50

# Sleep time in minutes between job checks (20-30 minutes is good enough
# for typical jobs)
# SpriteSleepTime = 20

# Notify by email when all jobs are done or SJMSprite terminates for 
# other reasons - give email address to this 
# SpriteEmailNotify =

# Check if SJMSprited still has a valid afs token. Suggested to set this
# at SLAC or where ever the token is needed to access files 
# SpriteCheckAfsToken = 0

SJMTestSnippet.tcl :

set ProdTclOnly true
source <INPUTTCL>
set levelOfDetail "cache"
set ConfigPatch   "Run2"
set BetaMiniTuple "root"
set jobReportName <JOBREPORT>
#set BetaOutputDir <TMPDIR>
set histFileName /home/csf/castelli/ana30/workdir/SJMoutput/<ID>/ntup_<ID>.root 
jobReport filename $jobReportName
sourceFoundFile BetaMiniUser/MyMiniAnalysis.tcl

Wrapper.sh :

#!/bin/bash

abort()
{
   echo "abort" > <EXITFILE>
   if [ <TMPDIR> ]; then
       rm -rf <TMPDIR>
   fi
   exit $EXITCODE
}
 
ulimit -c 0
if [ <TMPDIR> ]; then
    mkdir -p <TMPDIR>
    EXITCODE=$?
    if [ $EXITCODE != 0 ]; then 
	abort
    fi
fi 

<FRAMEAPP> <TCLSNIPPET>
EXITCODE=$?
if [ <TMPDIR> ]; then
    mv <TMPDIR>/* <RUNDIR>/.
    rm -rf <TMPDIR>
fi
echo "exit code $EXITCODE" > <EXITFILE>
exit $EXITCODE

Summury of things to do for using SJM with bbrbsub270

  1. Log in on a Grid User Inteface and get a valid proxy certificate with the grid-proxy-init command.
  2. Create your own analysis environment (e.g. ana30/workdir).
  3. Do (or adapt) srtpath in ana30 and cond18boot in workdir.
  4. In ana30 directory type addpkg SimpleJobManager ; then type gmake binscripts.
  5. Create the three input needed sjm files seen in the previous section and modify them for your own analysis, directory paths and file names.
  6. When all this is done you can normally use the SJM commands in the standard way.
  7. In the grid.job.txt file in the workdir directory is appended the output from edg-job-submit grid command for each grid job. That's only for being able in case something went wrong to see what went wrong after through the grid commands.

Giuliano Castelli 13:59, 8 Jun 2006 (BST)