BaBar Skimming Framework Outline

From GridPP Wiki
Jump to: navigation, search


Overview

Skimming takes an input dataset and run physics selection code against it producing a number of out streams each containing the subset of events that match certain selection criteria. Each event in the input sample (real data or Monte-Carlo) can be selected zero or many times depending on the event itself.

Splitting the input sample into chunks that can be processed in a reasonable time results in the output streams being split into many small files these are gathered together for each stream and merged into larger files which are more efficient for storage and analysis.

There are three distinct stages needed to skim BaBar data on the grid:

  1. Preloading the data at the execution site
  2. Creating, runnning and recovering the job
  3. Merging and Exporting the output data

Requirements

  • Central Submission Host.
  • Grid worker nodes.
  • Grid storage close accessable to the worker nodes for unskimed input data. Accessable via xroot or dcap.
  • Merge nodes, must have non grid submission to get merge jobs onto these nodes.
  • Iterim skim data storage. X TB or working space accessable to the central submission host and the merge nodes by NFS or xrootd.

Submission/Merging Site

  • Full install of BaBar Software
  • Sufficient CPUs to merge the returned data
  • Enough disk to hold the returned data before merging and the merged data before export, this storage should be performant enough to serve the data to the merging CPUs and receive the output
  • Bandwidth to export the output data to SLAC

Processing Site

  • Bandwidth to import the input data
  • Stroage to hold the input data and serve it to the CPUs
  • The Skimming code installed and available to all CPUs
  • CPUs to run the skimming code

Input Data Management

Process Outline

Check out from cvs

You can try out the task manager database setup.

You need to check out:

  • addpkg BbkTaskManager gc20070417b
  • addpkg BbkJobWrappers cajb20070312a
  • addpkg BbkSqlAbstractor tja070321a
  • addpkg BbkTools tja070108a
  • addpkg BbkUserTools tja070109a
  • addpkg BbkTMGridTools gc20070424a

You should do that in a appropriate release.

Then type

  • gmake binscripts

Now copy configFileTaskTemplate.txt and configFileTask.txt from BbkTaskManager to your workdir and edit configFileTask.txt to set

  • dbname = stm1
  • dbsite = ral

Then run:

BbkCreateTask --dbsite=ral --dbname=stm1 --dbuser=bbruser configFileTask.txt

you can check it is OK with:

BbkTaskInfo --dbsite=ral --dbname=stm1 --dbuser=bbruser --listtasks

Task Definition

For creating a task in the workdir directory two configuration files have to be created:

  • configFileTask.txt
  • configFileTaskTemplate.txt

The command for creating a task is:

BbkCreateTask --dbsite=ral --dbname=stm1 --dbuser=bbruser configFileTask.txt

where the values of dbsite, dbname and dbuser have to be set properly.

Task Info

The command for getting the infos of a task is:

BbkTaskInfo --dbsite=ral --dbname=stm1 --dbuser=bbruser --listtasks

Other useful commands...

BbkTMUser

BbkTMUser --dbname stm1 --dbsite ral --dbuser bbruser --taskname <taskname> --sjob_status 1 sjob_id rundir
BbkTMUser --dbsite ral --dbname stm1 --dbuser bbruser --taskname <taskname> 'MIN(sjob_id)' --sjob_status 1 -q --style=list
BbkTMUser --dbsite ral --dbname stm1 --dbuser bbruser --taskname <taskname> 'MAX(sjob_id)' --sjob_status 1 -q --style=list


You can then use BbkTMUser to look at the jobs created in the database.

Superjobs -> sub-jobs relation (to scan through the database): In the skim production setup every super job corresponds exactly to 1 full input collection (skim or PR). This input collection in turn is split into several separate jobs with the max. number of events per job as defined by the SkimMaxEvents option in the config. file. Also, every sub-jobs gets a sequence number to figure out later which place this sub-jobs has in the overall sequence of events when all the small collections are merged back together again.

BbkTMUser allows you to query the details of what is in the database You can make a list with BbkTMUser --list-columns.

You can look at the table layout in the BbkTaskManager package and a graph of it in http://www.cwroethel.net/Projects/projects.jsp?project=taskmanager2

I don't think it's too hard to understand.

To get a list of collections used as input to a task run:

BbkTMUser --dbsite ... --taskname <taskname> ssin_coll ss_id

To get a list of all jobs for a task, displaying the relationship between super jobs (ss_id), skim jobs (sjob_id), sequence no (sjob_seqno) and input for each individual job (sjin_coll) run this:

BbkTMUser --dbsite ... --taskname <taskname> ss_id sjob_id sjob_seqno sjin_coll

To get a list of collections from a dataset that have been so far used to create jobs in your task run

BbkTMUser --dbsite <...> --dbname <...> --dbuser <...> --taskname <taskname> ssin_coll

e.g.:

roethel@noric03> BbkTMUser --dbsite local/sparky --dbname bbrora --dbuser anyuser --taskname <taskname> ssin_coll
: SSIN_COLL                                               :
: /store/PRskims/R18/18.6.0b/AllEvents/88/AllEvents_8897  :
: /store/PRskims/R18/18.6.0b/AllEvents/90/AllEvents_9007  :
: /store/PRskims/R18/18.6.0b/AllEvents/88/AllEvents_8847  :
: /store/PRskims/R18/18.6.0b/AllEvents/88/AllEvents_8851  :
: /store/PRskims/R18/18.6.0b/AllEvents/90/AllEvents_9002  :
: /store/PRskims/R18/18.6.1b/AllEvents/32/AllEvents_13289 :
: /store/PRskims/R18/18.6.0b/AllEvents/88/AllEvents_8848  :
: /store/PRskims/R18/18.6.0b/AllEvents/88/AllEvents_8861  :
8 rows returned from bbrora at local/sparky

to find submitted jobs run e.g.:

BbkTMUser --dbname stm1 --dbsite ral --dbuser bbruser --taskname <taskname> --sjob_status 1 sjob_id rundir

An example with the BBK database:

BbkUser --dbname=bbkr18 'MAX(dse_id)' -s
SELECT    MAX(bbkr18.bbk_dsentities.id) AS "max_dse_id_" FROM bbkr18.bbk_dsentities;
MAX_DSE_ID
672192
1 rows returned from bbkr18 at ral

(you need the quotes, otherwise the shell interprets the brackets). You can use any SQL function, applied to the logical column name (it might not always do what you want for more complicated aggregate expressions, but this is a simple case).

the easiest way to get the min and max job id's is then

 BbkTMUser --dbsite ral --dbname stm1 --dbuser bbruser --taskname <taskname> 'MIN(sjob_id)' 'MAX(sjob_id)'

(don't know if this is the correct taskname, but simply replace it with the right one if it's not). Also use the desired output formatting and possibly the '-q' option. If you only are looking into submitted jobs add the option --sjob_status 1 , i.e.

roethel@noric03> BbkTMUser --dbsite ral --dbname stm1 --dbuser bbruser --taskname <taskname> 'MIN(sjob_id)' 'MAX (sjob_id)' --sjob_status 1
: MIN_SJOB_ID : MAX_SJOB_ID :
: 78792       : 78796       :


Adding --style=list -q you get back just the two numbers you want without all the extra gumph and all the extraneous stuff.

BbkEditTask

BbkEditTask -c <config-update-file> <taskname>
BbkEditTask -w <current-config-file> <taskname>

You can re-use your existing task by simply creating a config update file with the one line above (or anything else you would like to change with the exception of the database connection parameters, which you can not update) and run BbkEditTask -c <config-update-file> <taskname>

You can always check the current configuration parameters of a task with BbkEditTask -w <current-config-file> <taskname> and the current task configuration will be dumped into <current-config-file>.

When the task is created/edited the contents of the template files are read in and actually stored in the database. So if you want to update the configuration you not only need to change the template files, but also make the task manager read in and store the contents of the file. Trying

BbkEditTask --writeall ....

you will see what I mean.

(Just put these two lines into a file, e.g. called configUpdates.txt and run

BbkEditTask -c configUpdates.txt <taskname> 

to update the task. You can check if the update was successful by looking at the new configuration with

BbkEditTask --writeall configDetails.txt <taskname> 

where the newly created file configDetails.txt will contain all the details on the current configuration including the currently stored script templates).

But if you just run BbkEditTask as I suggested you don't need it. (Try using to BbkEditTask if possible, it helps you avoid creating all jobs again - you just keep what you have already created and newly submitted jobs will use the new configuration.)

Skim Job Processing

Job Definition

Job Creation

BbkCreateSkimJobs -n 2 <taskname>


BbkCreateSkimJobs is now validated to create skim jobs. For test purposes there is currently a command line option available to limit the number of super jobs created, e.g.

BbkCreateSkimJobs -n 2 <taskname>

will create 2 super jobs for _each_ dataset available for the task (yes, that's somewhat of a bug, but whatever).

The -t option will create the objects but not write them to the database (good for debugging and playing around).

Job Submission

BbkSubmitSkims -v --njobs 1 <taskname>

The optional -t option create all the directory tree without submit the jobs.

Job Execution

Where: Grid Worker Node

Job Checking

Where: Grid Worker Node Once the skim job execution has completed a check script is run on the worker node which parses the logs and checks the output data.

BbkCheckSkims <taskname>


The signature for checking jobs is:

job finished:

  • job wrapper report file exists
  • log file exists
  • job wrapper report file contains the entry "Bbk::TM.wrapperExitCode"

job status unknown (appears to be hung or crashed)

  • log file exists
  • job wrapper file does not exist or exists but does not contain the entry "Bbk::TM.wrapperExitCode"
  • log file has last been updated > 1 hour ago

job still pending or running

  • any other configuration.

Output Data Upload

Where: Grid Worker Node Once the output is checked the data and log files are gathered together into a tarball and copied up to grid storage element.

Job Recovery

When the grid job monitoring reports the job as "Done" the output data is recovered from the grid storage element. The log files are copied back into the run directory and the skimmed data is copied to the interim skim storage.

Output Data Merging

Once enough skimmed data has been accumulated on the interim storage merge jobs are defined which takes the output of many jobs and merges the files from each of the streams to produce a larger file that is easier to manage.

Merged Data Export

Job Monitoring

Chris brew & Giuliano Castelli 12:10, 21 Aug 2006 (BST)