BaBar Job Submission Example

From GridPP Wiki
Jump to: navigation, search

This example shows verbatim, the commands used to set up, build, and run a simple analysis-30 Beta job at RAL. analysis-30 should be used for reading R18 data and Monte Carlo (run5 and reprocessed run1 to run4). With minor updates, these instructions may also work with later analysis releases.

In the following examples, user input is shown underlined - after the default tclsh shell prompt.

Log on

Log into one of the RAL Tier A front-end machines. It is best to use the alias babar.rl.ac.uk, which will connect to one of the machines randomly (thus distributing the load). Most users will have the same username at RAL as at SLAC, but if the username is different, you can specify it as we do here for user babartst.

adye@yakut01 $ ssh babartst@babar.rl.ac.uk

RAL Tier1/A SL 3.0.3 - Configured by PXE/Kickstart
RAL Tier1/A SL 3.0.3 standard update installed Tue Dec 21 16:38:12 GMT  2004
RAL Tier1/A SL 3.0.3 frontend config installed Tue Dec 21 16:42:43 GMT 2004

...

For BaBar-specific problems, please post to the RAL Tier A HyperNews Forum (or
contact Tim Adye <T.J.Adye@rl.ac.uk> or Emmanuel Olaiya <E.O.Olaiya@rl.ac.uk>).
For more general problems contact RAL User Support <support@gridpp.rl.ac.uk>.
[csfd] ~ >

In this case we were connected to csfd.rl.ac.uk.

Build

Create an analysis-30 test release in your home directory. (Your normal home directory at RAL is in NFS. You can use RAL AFS if you prefer, but beware that write access to AFS from batch jobs is problematic since your AFS token is not passed to the jobs.)

To save space in your home area, it is best to put the binaries in /stage/babar-user1/USER (where USER is your username, in this example babartst). The /stage/babar-user1/USER area has a lot more space than your home area, but is not backed up. The newrel -s option can be used to set up the release in your home area and binaries in /stage/babar-user1/USER.

[csfd] ~ > newrel -t -s /stage/babar-user1/babartst analysis-30 ana30
newrel version: 1.13 
GNU Make version 3.79.1,
Build OPTIONS = Linux24SL3_i386_gcc323-Debug-native-Objy-Optimize-Fastbuild-Ldlink2-SkipSlaclog-Static-Lstatic
Linux csfd.rl.ac.uk 2.4.21-37.ELsmp #1 SMP Wed Sep 28 12:13:44 CDT 2005 i686 i686 i386 GNU/Linux  [uname -a]
[Warning]: ./bin/Linux24SL3_i386_gcc323 (and/or) /afs/rl.ac.uk/bfactory/dist/releases/18.6.2a/bin/Linux24SL3_i386_gcc323 is not in PATH, type 'srtpath' to fix PATH.
-> installdirs:
Creating database/GNUmakefile from release 18.6.2a
next, addpkg, checkout or ln -s to your packages, then gmake installdirs
      remember to run srtpath. (see man page of srtpath about setting it up)

[csfd] ~ > cd ana30

[csfd] ~/ana30 > srtpath
enter release number (CR=18.6.2a):
<Return>
Select/enter BFARCH (CR=1):
1) Linux24SL3_i386_gcc323      [prod][test][active][default]
2) Linux24RHEL3_i386_gcc323    [default2]
<Return>

Check out analysis packages:-

[csfd] ~/ana30 > addpkg BetaMiniUser
Offline Release 18.6.2a uses BetaMiniUser version V00-04-00, will check that out
...
[csfd] ~/ana30 > addpkg workdir
Offline Release 18.6.2a uses workdir version V00-04-20, will check that out
...

Note that it may be necessary to check out some additional fixes. See the Extra Tags page for the latest.

Setup, compile, and link:-

[csfd] ~/ana30 > gmake installdirs
GNU Make version 3.79.1,
Build OPTIONS = Linux24SL3_i386_gcc323-Debug-native-Objy-Optimize-Fastbuild-Ldlink2-SkipSlaclog-Static-Lstatic
Linux csfd.rl.ac.uk 2.4.21-37.ELsmp #1 SMP Wed Sep 28 12:13:44 CDT 2005 i686 i686 i386 GNU/Linux  [uname -a]
-> installdirs:

[csfd] ~/ana30 > gmake BetaMiniUser.all
GNU Make version 3.79.1,
Build OPTIONS = Linux24SL3_i386_gcc323-Debug-native-Objy-Optimize-Fastbuild-Ldlink2-SkipSlaclog-Static-Lstatic
Linux csfd.rl.ac.uk 2.4.21-37.ELsmp #1 SMP Wed Sep 28 12:13:44 CDT 2005 i686 i686 i386 GNU/Linux  [uname -a]
-> BetaMiniUser.all:   (Mon Jan  9 13:31:00 GMT 2006)
...
Linking BetaMiniApp in BetaMiniUser [link-1]
bin stage done in /home/csf/babartst/ana30/BetaMiniUser

The link step may take some time (10-30 minutes). We are investigating ways to speed this up.

Setup

[csfd] ~/ana30 > cd workdir
[csfd] ~/ana30/workdir > gmake setup
RELDIR not specified. Defaults to ../
Release directory set to ../

Next we create a tcl files listing the collections we will run over.

[csfd] ~/ana30/workdir > BbkDatasetTcl A0-Run5-OffPeak-R18b --tcl 200k --splitruns
babartst@gromit.slac.stanford.edu's password: *******
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-1.tcl (1 collections, 200000 events)
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-2.tcl (1 collections, 200000 events)
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-3.tcl (2 collections, 200000 events)
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-4.tcl (1 collections, 200000 events)
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-5.tcl (1 collections, 200000 events)
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-6.tcl (2 collections, 200000 events)
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-7.tcl (2 collections, 200000 events)
BbkDatasetTcl: wrote A0-Run5-OffPeak-R18b-8.tcl (1 collections, 198980 events)
Selected 4 collections, 1598980/53931309 events, ~4012.2/pb, from bbkr18 at ral

You may be prompted for a password to connect to gromit.slac.stanford.edu. You should enter your normal SLAC password. This will only be done the first time - thereafter the database connection information will be cached (in ~/.bbk/sites). See the Bookkeeping FAQ for help if this doesn't work.

Each job will run over one of these tcl files. We limited the number of events to 200000 per job to ensure that it did not run over time (this will depend upon your particular analysis job, so you should experiment with different splittings: longer jobs will be more efficient and be simpler to manage, but you want a reasonable turnaround so it's best not to have a job that runs longer than a day or so - the default job limit is 24 hours of CPU time). For this simple example, the job takes about an hour of CPU time.

Each job will need a tcl snippet to configure how it is to be run. Here is an example for the first job. Let's call that run-A0-Run5-OffPeak-R18b-1.tcl (to match the data specification file, A0-Run5-OffPeak-R18b-1.tcl):-

source A0-Run5-OffPeak-R18b-1.tcl
set levelOfDetail "cache"
set ConfigPatch   "Run2"
set BetaMiniTuple "root"
set histFileName  "hist-A0-Run5-OffPeak-R18b-1.root"
sourceFoundFile BetaMiniUser/MyMiniAnalysis.tcl

For a quick test, you may want to add

set NEvent 500

at the top (or anywhere before the sourceFoundFile line) to just run over the first 500 events.

Make sure that each job's output files (hist-A0-Run5-OffPeak-R18b-1.root in this example) have unique names, otherwise one job will overwrite the next. ConfigPatch should be set to "Run2" (for real data) or "MC" (for Monte Carlo).

Run

Set up the conditions database:-

[csfd] ~/ana30/workdir > cond18boot
Setting OO_FD_BOOT to babarams1.rl.ac.uk::/raid/objy/databases/conditions/current/194/BaBar.BOOT
and unsetting OO_AMS_USAGE

Remember to redo the srtpath and cond18boot steps each time you log in.

From the workdir, we submit the first job like this

[csfd] ~/ana30/workdir > bbrbsub BetaMiniApp run-A0-Run5-OffPeak-R18b-1.tcl
1821979.csflnx353.rl.ac.uk

[csfd] ~/ana30/workdir > qstat -u babartst

csflnx353.rl.ac.uk: 
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1821979.csflnx3 babartst sl3p     BetaMiniAp    --    1  --    --  24:00 Q   -- 

You can use the bbrbsub -mae option to e-mail you when the job aborts or exits. The e-mail is sent to your RAL Tier A account, so you should keep an eye on that, or else set up a .forward file in your home directory.

When the job is done, the logfile is available in a file BetaMiniApp.o1821979 (where the number is the job number displayed by the bbrbsub command). There may also be a BetaMiniApp.e1821979 file, which contains any error messages. The logfile contains far more detail than you are likely to be interested in but, after a lot of setup, you should see messages like

EvtCounter: processing event # 1 [ 7f:4fff7fff:39d838/d95339ab:Y ]

finishing with

EvtCounter: processing event # 200000 [ 7f:4fff7fff:39dc0a/516c1407:X ]

See the section on BaBar RAL Tier A documentation for details of job submission at RAL. This is the only major difference between running at RAL, compared to SLAC.

Comments

That's just the first of 8 jobs done. Of course many people set up scripts to create the tcl snippets and submit the jobs. A general-purpose framework for doing this is the Simple Job Manager.

Further Information