Difference between revisions of "BaBar Job Submission Example"
(No difference)
|
Latest revision as of 19:44, 9 January 2006
This example shows verbatim, the commands used to set up, build, and run a simple analysis-30 Beta job at RAL. analysis-30 should be used for reading R18 data and Monte Carlo (run5 and reprocessed run1 to run4). With minor updates, these instructions may also work with later analysis releases.
In the following examples, user input is shown underlined - after the default tclsh shell prompt.
Log on
Log into one of the RAL Tier A front-end machines. It is best to use the alias babar.rl.ac.uk, which will connect to one of the machines randomly (thus distributing the load). Most users will have the same username at RAL as at SLAC, but if the username is different, you can specify it as we do here for user babartst.
adye@yakut01 $ ssh babartst@babar.rl.ac.uk
RAL Tier1/A SL 3.0.3 - Configured by PXE/Kickstart
RAL Tier1/A SL 3.0.3 standard update installed Tue Dec 21 16:38:12 GMT 2004
RAL Tier1/A SL 3.0.3 frontend config installed Tue Dec 21 16:42:43 GMT 2004
...
For BaBar-specific problems, please post to the RAL Tier A HyperNews Forum (or
contact Tim Adye <T.J.Adye@rl.ac.uk> or Emmanuel Olaiya <E.O.Olaiya@rl.ac.uk>).
For more general problems contact RAL User Support <support@gridpp.rl.ac.uk>.
[csfd] ~ >
In this case we were connected to csfd.rl.ac.uk.
Build
Create an analysis-30 test release in your home directory. (Your normal home directory at RAL is in NFS. You can use RAL AFS if you prefer, but beware that write access to AFS from batch jobs is problematic since your AFS token is not passed to the jobs.)
To save space in your home area, it is best to put the binaries in /stage/babar-user1/USER (where USER is your username, in this example babartst). The /stage/babar-user1/USER area has a lot more space than your home area, but is not backed up. The newrel -s option can be used to set up the release in your home area and binaries in /stage/babar-user1/USER.
[csfd] ~ > newrel -t -s /stage/babar-user1/babartst analysis-30 ana30 newrel version: 1.13 GNU Make version 3.79.1, Build OPTIONS = Linux24SL3_i386_gcc323-Debug-native-Objy-Optimize-Fastbuild-Ldlink2-SkipSlaclog-Static-Lstatic Linux csfd.rl.ac.uk 2.4.21-37.ELsmp #1 SMP Wed Sep 28 12:13:44 CDT 2005 i686 i686 i386 GNU/Linux [uname -a] [Warning]: ./bin/Linux24SL3_i386_gcc323 (and/or) /afs/rl.ac.uk/bfactory/dist/releases/18.6.2a/bin/Linux24SL3_i386_gcc323 is not in PATH, type 'srtpath' to fix PATH. -> installdirs: Creating database/GNUmakefile from release 18.6.2a next, addpkg, checkout or ln -s to your packages, then gmake installdirs remember to run srtpath. (see man page of srtpath about setting it up) [csfd] ~ > cd ana30 [csfd] ~/ana30 > srtpath enter release number (CR=18.6.2a): <Return> Select/enter BFARCH (CR=1): 1) Linux24SL3_i386_gcc323 [prod][test][active][default] 2) Linux24RHEL3_i386_gcc323 [default2] <Return>
Check out analysis packages:-
[csfd] ~/ana30 > addpkg BetaMiniUser Offline Release 18.6.2a uses BetaMiniUser version V00-04-00, will check that out ... [csfd] ~/ana30 > addpkg workdir Offline Release 18.6.2a uses workdir version V00-04-20, will check that out ...
Note that it may be necessary to check out some additional fixes. See the Extra Tags page for the latest.
Setup, compile, and link:-
[csfd] ~/ana30 > gmake installdirs GNU Make version 3.79.1, Build OPTIONS = Linux24SL3_i386_gcc323-Debug-native-Objy-Optimize-Fastbuild-Ldlink2-SkipSlaclog-Static-Lstatic Linux csfd.rl.ac.uk 2.4.21-37.ELsmp #1 SMP Wed Sep 28 12:13:44 CDT 2005 i686 i686 i386 GNU/Linux [uname -a] -> installdirs: [csfd] ~/ana30 > gmake BetaMiniUser.all GNU Make version 3.79.1, Build OPTIONS = Linux24SL3_i386_gcc323-Debug-native-Objy-Optimize-Fastbuild-Ldlink2-SkipSlaclog-Static-Lstatic Linux csfd.rl.ac.uk 2.4.21-37.ELsmp #1 SMP Wed Sep 28 12:13:44 CDT 2005 i686 i686 i386 GNU/Linux [uname -a] -> BetaMiniUser.all: (Mon Jan 9 13:31:00 GMT 2006) ... Linking BetaMiniApp in BetaMiniUser [link-1] bin stage done in /home/csf/babartst/ana30/BetaMiniUser
The link step may take some time (10-30 minutes). We are investigating ways to speed this up.
Setup
[csfd] ~/ana30 > cd workdir [csfd] ~/ana30/workdir > gmake setup RELDIR not specified. Defaults to ../ Release directory set to ../
Next we create a tcl files listing the collections we will run over.
[csfd] ~/ana30/workdir > BbkDatasetTcl A0-Run5-OffPeak-R18b --tcl 200k --splitruns
babartst@gromit.slac.stanford.edu's password: *******
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-1.tcl (1 collections, 200000 events)
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-2.tcl (1 collections, 200000 events)
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-3.tcl (2 collections, 200000 events)
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-4.tcl (1 collections, 200000 events)
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-5.tcl (1 collections, 200000 events)
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-6.tcl (2 collections, 200000 events)
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-7.tcl (2 collections, 200000 events)
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-8.tcl (1 collections, 198980 events)
Selected 4 collections, 1598980/53931309 events, ~4012.2/pb, from bbkr18 at ral
You may be prompted for a password to connect to gromit.slac.stanford.edu. You should enter your normal SLAC password. This will only be done the first time - thereafter the database connection information will be cached (in ~/.bbk/sites). See the Bookkeeping FAQ for help if this doesn't work.
Each job will run over one of these tcl files. We limited the number of events to 200000 per job to ensure that it did not run over time (this will depend upon your particular analysis job, so you should experiment with different splittings: longer jobs will be more efficient and be simpler to manage, but you want a reasonable turnaround so it's best not to have a job that runs longer than a day or so - the default job limit is 24 hours of CPU time). For this simple example, the job takes about an hour of CPU time.
Each job will need a tcl snippet to configure how it is to be run. Here is an example for the first job. Let's call that run-A0-Run5-OffPeak-R18b-1.tcl (to match the data specification file, A0-Run5-OffPeak-R18b-1.tcl):-
source A0-Run5-OffPeak-R18b-1.tcl set levelOfDetail "cache" set ConfigPatch "Run2" set BetaMiniTuple "root" set histFileName "hist-A0-Run5-OffPeak-R18b-1.root" sourceFoundFile BetaMiniUser/MyMiniAnalysis.tcl
For a quick test, you may want to add
set NEvent 500
at the top (or anywhere before the sourceFoundFile line) to just run over the first 500 events.
Make sure that each job's output files (hist-A0-Run5-OffPeak-R18b-1.root in this example) have unique names, otherwise one job will overwrite the next. ConfigPatch should be set to "Run2" (for real data) or "MC" (for Monte Carlo).
Run
Set up the conditions database:-
[csfd] ~/ana30/workdir > cond18boot
Setting OO_FD_BOOT to babarams1.rl.ac.uk::/raid/objy/databases/conditions/current/194/BaBar.BOOT
and unsetting OO_AMS_USAGE
Remember to redo the srtpath and cond18boot steps each time you log in.
From the workdir, we submit the first job like this
[csfd] ~/ana30/workdir > bbrbsub BetaMiniApp run-A0-Run5-OffPeak-R18b-1.tcl 1821979.csflnx353.rl.ac.uk [csfd] ~/ana30/workdir > qstat -u babartst csflnx353.rl.ac.uk: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 1821979.csflnx3 babartst sl3p BetaMiniAp -- 1 -- -- 24:00 Q --
You can use the bbrbsub -mae option to e-mail you when the job aborts or exits. The e-mail is sent to your RAL Tier A account, so you should keep an eye on that, or else set up a .forward file in your home directory.
When the job is done, the logfile is available in a file BetaMiniApp.o1821979 (where the number is the job number displayed by the bbrbsub command). There may also be a BetaMiniApp.e1821979 file, which contains any error messages. The logfile contains far more detail than you are likely to be interested in but, after a lot of setup, you should see messages like
EvtCounter: processing event # 1 [ 7f:4fff7fff:39d838/d95339ab:Y ]
finishing with
EvtCounter: processing event # 200000 [ 7f:4fff7fff:39dc0a/516c1407:X ]
See the section on BaBar RAL Tier A documentation for details of job submission at RAL. This is the only major difference between running at RAL, compared to SLAC.
Comments
That's just the first of 8 jobs done. Of course many people set up scripts to create the tcl snippets and submit the jobs. A general-purpose framework for doing this is the Simple Job Manager.
Further Information
- BaBar RAL Tier A documentation
- BaBar Offline workbook (mostly examples for SLAC, though RAL usage is very similar).
- RAL Tier A User Guide (general documentation for the farm - not BaBar-specific).