BaBar Skimming Operational Tasks

From GridPP Wiki
Jump to: navigation, search


Clearing skim jobs stuck in state aborted

First get yourself a list of the aborted skim jobs:

[bbrprod@pc20 workdir]$ BbkTMGridJobStatus --list --skim MAN-Task01-funny04-R24a1 | grep aborted
435460 V05 as https://lcgrb02.gridpp.rl.ac.uk:9000/FuJL7rhFAtVeco389y41PA aborted Sat Aug  9 18:05:15 2008
435461 V05 as https://lcgrb02.gridpp.rl.ac.uk:9000/NXp58RMmIun5DgwUa8rjDQ aborted Sat Aug  9 18:05:13 2008
<...>
441059 V01 as https://lcgrb02.gridpp.rl.ac.uk:9000/uorwKaMe6P1J-ONWkHuIVA aborted Tue Aug 12 17:11:12 2008

The use BbkCheckSkims to reset them back to pending:

[bbrprod@pc20 workdir]$ BbkCheckSkims --pend --recover --jobid=435460 MAN-Task01-funny04-R24a1 
Setting job status 435460:
    Resetting job to status 'prepared'.

Now if there are quite a lot of aborted jobs rather than doing the one at a time you can give BbkCheckSkims and comma separated list:

[bbrprod@pc20 workdir]$ BbkTMGridJobStatus --list --skim MAN-Task01-funny04-R24a1 | awk '/aborted/{print $1}' | xargs | tr ' ' , 
435461,439069,439079,439142,439185,439186,439187,439188,439189,439243,439365,439390,439393,439523,439557,439617,441059 
[bbrprod@pc20 workdir]$ BbkCheckSkims --pend --recover --jobid=435461,439069,439079,439142,439185,439186,439187,439188,439189,439243,439365,439390,439393,439523,439557,439617,441059 MAN-Task01-funny04-R24a1
Setting job status 435461:
    Resetting job to status 'prepared'.
Setting job status 439069:
    Resetting job to status 'prepared'.
<...>

Clearing skim jobs stuck in state cleared

Mostly these should be treated like aborted jobs since in general it means an problem on the worker node meant that the job payload failed to run the skim wrapper but the "grid" portion of the job did run successfully. However sometimes these are caused by a glitch in the retrieval system so it's worth first retrying the Retrieve Job command with a --force option.

[bbrprod@pc20 workdir]$ BbkTMGridJobStatus --list --skim MAN-Task01-funny04-R24a1 | grep cleared
435968 V03 as https://lcgrb02.gridpp.rl.ac.uk:9000/AbpHQhLOVxSym4ElzuVdWg cleared Mon Aug 11 22:49:34 2008
<...>
[bbrprod@pc20 workdir]$ BbkRetrieveGridJob --force --temp /var/tmp MAN-Task01-funny04-R24a1 435968
BbkRetrieveGridJob error: Could not get output for BbkTMGridTools::BbkTMGridSkimJob=HASH(0xa841508)->gridJobId
BbkRetrieveGridJob warning: /skims_435968.out not returned

Abandoning merges that will never complete

If you get a merge job that is never successfully running, for instance if some of the input files are missing, then it needs to be abandoned and the input jobs reset.

Unlike the merge jobs do not have a hold on them so that after three failures they do not get resubmitted without a manual intervention so you should keep an eye on the version numbers of the built and submitted merge jobs. For example to check the versions of the submitted jobs:

[bbrprod@pc20 workdir]$ BbkShowMergeJobs --status 1 MAN-Task01-R24a3

   merge job id   job version       rundirectory   
-------------------------------------------------------------------------------
       1028         1  /mnt/data/export/tasks/merge/tasks/MAN-Task01-R24a3/01/1028/MergeV01

The status number of the prepared jobs is '0' (zero), not all prepared jobs will be listed just those that have already been submitted at least once already.

If the version number gets above two or three you should look at the log files of the previous version and try to work out what is the problem. If it looks as though the merge job will never succeed (say the input files are missing or corrupt) then you need to reset the merge super job.

To get the merge super id for a merge job run:

[bbrprod@pc20 workdir]$ BbkTMUser --dbsite man --dbname stm1 --ms_taskname=MAN-Task01-R24a3 --mjob_id=1028 ms_id
MS_ID
26
1 rows returned from stm1 at man

Then to abandon that merge super job and reset the skim jobs:

BbkCheckMerges --abandon --mergesuper=26 MAN-Task01-R24a3

Checking the input files for a merge job

[bbrprod@pc20 workdir]$ for file in `BbkTMUser --dbsite man --dbname stm1 --quiet --ms_taskname=MAN-Task01-R24a3 --mjob_id=1028 sfile_name`
> do
> KanFileCheck $file 2>/dev/null
> done
/prod/skims/MAN-Task01-R24a3/09/9711/SkimV01/BToPhi3K.01.root exists
/prod/skims/MAN-Task01-R24a3/09/9712/SkimV01/BToPhi3K.01.root exists
/prod/skims/MAN-Task01-R24a3/09/9713/SkimV01/BToPhi3K.01.root exists
/prod/skims/MAN-Task01-R24a3/09/9714/SkimV01/BToPhi3K.01.root exists
...
/prod/skims/MAN-Task01-R24a3/10/10197/SkimV01/BToPhi3K.01.root exists

Stopping sprite from performing an action for a period

Sometimes when you hit a problem it is useful to stop sprite from preforming a specific action for a period while you look into it or put a fix in place.

Sprite keeps track of it's previous actions by touching files in the Skim task directory:

[bbrprod@pc20 workdir]$ ls -1 ~/skimming/tasks/MAN-Task01-R24a3/.sprite_*
/home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_check_merges
/home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_check_skims
/home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_create_merges
/home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_create_skims
/home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_export_data
/home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_fix_hanging
/home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_submit_merge_jobs
/home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_submit_skim_jobs

(defined in the config file as SkimRunPrefix) and then looking at the timestamp on the files. It is therefore quite easy then to touch the one or more of the files in the future to prevent that action being performed until that time.

So to stop Sprite from trying to export any data for the next four hours you would run:

touch --date `date --date 'now + 4hours'` ~/skimming/tasks/MAN-Task01-R24a3/.sprite_export_data

or to stop anything happening for the next day:

touch --date `date --date 'now + 24hours'` ~/skimming/tasks/MAN-Task01-R24a3/.sprite_*

Checking the existence of a file or collection in the xrootd system

You can use KanCollUtil and KanFileCheck to see if a collection or file is good in the xrootd system.

KanFileCheck takes the file name (and probably produces a load of useless warnings):

[bbrprod@pc20 workdir]$ KanFileCheck /prod/skims/MAN-Task01-R24a3/10/10111/SkimV01/BToPhi3K.01.root
Warning in <TClass::TClass>: no dictionary for class KanArray_String is available
Warning in <TClass::TClass>: no dictionary for class KanArray is available
Warning in <TClass::TClass>: no dictionary for class KanBranch is available
...
Warning in <TClass::TClass>: no dictionary for class BtaCandVtxI is available
/prod/skims/MAN-Task01-R24a3/10/10111/SkimV01/BToPhi3K.01.root exists

KanCollUtil takes the collection name (the file name with the .XXxxx.root stripped off:

[bbrprod@pc20 workdir]$ KanCollUtil /prod/skims/MAN-Task01-R24a3/10/10111/SkimV01/BToPhi3K
/prod/skims/MAN-Task01-R24a3/10/10111/SkimV01/BToPhi3K (299 events)

These will work for the input collections and output from the skim and merge jobs. The only thing to not is that the output of the merge jobs have /prod appended to the start of the file and collection names in xrootd to keep them in a separate area so:

[bbrprod@pc20 workdir]$ BbkTMUser --dbsite man --dbname stm1 --ms_taskname=MAN-Task01-R24a3 --mjob_id=1027 mfile_name
MFILE_NAME
/store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983.01.root
1 rows returned from stm1 at man
You have new mail in /var/spool/mail/bbrprod
[bbrprod@pc20 workdir]$ KanFileCheck /store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983.01.root
Error in <TXNetFile::CreateXClient>: open attempt failed on root://xrootd01.tier2.hep.manchester.ac.uk//store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983.01.root
/store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983.01.root does not exist
[bbrprod@pc20 workdir]$ KanFileCheck /prod/store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983.01.root 2>/dev/null
/prod/store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983.01.root exists
[bbrprod@pc20 workdir]$ KanCollUtil /prod/store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983
/prod/store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983 (220463 events)

Cleaning up the exported files from the export disk

At the moment the export command does not clean up files that have successfully been exported to SLAC and we need to keep an eye on the space on the export disk:

To check the disk space use df:

[bbrprod@pc20 workdir]$ df -h /mnt/data
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb1             917G  554G  317G  64% /mnt/data

To actually delete files that have been exported and are good at SLAC:

for file in `BbkTMUser --dbsite=man --dbname stm1 --ms_taskname 'MAN-Task01-R24a?' --ms_proc_status=3 --distinct --quiet mfile_name`
do
  if [ -e /mnt/data/export$file ]
  then 
    rm /mnt/data/export$file
  fi
done

(if you just want to produce a list replace the rm with echo).

Another source of data use is merge files that have been retrieved off xrootd for export but then have been abandoned because another merge job has missing input data.

for file in `BbkTMUser --dbsite=man --dbname stm1 --ms_taskname 'MAN-Task01-R24a?' --mjob_status=6 --distinct --quiet mfile_name`
do
  if [ -e /mnt/data/export$file ]
  then 
    rm /mnt/data/export$file
  fi
done

I suspect that we should be also running something like this to remove the exported files from xrootd as well since I think that step also fails.

Changing the Resource Broker the Grid Jobs are Submitted Via

Which Resource Broker is used is controlled by an entry in a file passed to the job submission edg-job-submit via the --config-vo option. For the production running (and other running unless there is a need to change it this should be the file /home/bbrprod/ui.conf which should contain two uncommented lines like:

NSAddresses = {"lcgrb02.gridpp.rl.ac.uk:7772"};
LBAddresses = {{"lcgrb02.gridpp.rl.ac.uk:9000"}};

Both of these lines need to be edited to change the RB and should point to the same server (the port numbers [the bit after the ':'] are different though).

You can get a list of possible RBs using the lcg-infosites command:

[bbrprod@pc20 workdir]$ lcg-infosites --vo babar rb
lcgrb02.gridpp.rl.ac.uk:7772
grid014.ct.infn.it:7772
gridrb.fe.infn.it:7772
rb-fzk.gridka.de:7772
gram://prod-rb-01.pd.infn.it:7772
gfe01.hep.ph.ic.ac.uk:7772
rb-2-fzk.gridka.de:7772
lcgrb01.gridpp.rl.ac.uk:7772
egee-rb-01.cnaf.infn.it:7772
rb-1-fzk.gridka.de:7772

We've generally used lcgrb01 and lcgrb02 at RAL and egee-rb-01 at CNAF