Difference between revisions of "BaBar Skimming Operational Tasks"
Chris brew (Talk | contribs) |
(No difference)
|
Latest revision as of 11:10, 22 October 2008
Contents
- 1 Clearing skim jobs stuck in state aborted
- 2 Clearing skim jobs stuck in state cleared
- 3 Abandoning merges that will never complete
- 4 Checking the input files for a merge job
- 5 Stopping sprite from performing an action for a period
- 6 Checking the existence of a file or collection in the xrootd system
- 7 Cleaning up the exported files from the export disk
- 8 Changing the Resource Broker the Grid Jobs are Submitted Via
Clearing skim jobs stuck in state aborted
First get yourself a list of the aborted skim jobs:
[bbrprod@pc20 workdir]$ BbkTMGridJobStatus --list --skim MAN-Task01-funny04-R24a1 | grep aborted 435460 V05 as https://lcgrb02.gridpp.rl.ac.uk:9000/FuJL7rhFAtVeco389y41PA aborted Sat Aug 9 18:05:15 2008 435461 V05 as https://lcgrb02.gridpp.rl.ac.uk:9000/NXp58RMmIun5DgwUa8rjDQ aborted Sat Aug 9 18:05:13 2008 <...> 441059 V01 as https://lcgrb02.gridpp.rl.ac.uk:9000/uorwKaMe6P1J-ONWkHuIVA aborted Tue Aug 12 17:11:12 2008
The use BbkCheckSkims to reset them back to pending:
[bbrprod@pc20 workdir]$ BbkCheckSkims --pend --recover --jobid=435460 MAN-Task01-funny04-R24a1 Setting job status 435460: Resetting job to status 'prepared'.
Now if there are quite a lot of aborted jobs rather than doing the one at a time you can give BbkCheckSkims and comma separated list:
[bbrprod@pc20 workdir]$ BbkTMGridJobStatus --list --skim MAN-Task01-funny04-R24a1 | awk '/aborted/{print $1}' | xargs | tr ' ' , 435461,439069,439079,439142,439185,439186,439187,439188,439189,439243,439365,439390,439393,439523,439557,439617,441059 [bbrprod@pc20 workdir]$ BbkCheckSkims --pend --recover --jobid=435461,439069,439079,439142,439185,439186,439187,439188,439189,439243,439365,439390,439393,439523,439557,439617,441059 MAN-Task01-funny04-R24a1 Setting job status 435461: Resetting job to status 'prepared'. Setting job status 439069: Resetting job to status 'prepared'. <...>
Clearing skim jobs stuck in state cleared
Mostly these should be treated like aborted jobs since in general it means an problem on the worker node meant that the job payload failed to run the skim wrapper but the "grid" portion of the job did run successfully. However sometimes these are caused by a glitch in the retrieval system so it's worth first retrying the Retrieve Job command with a --force option.
[bbrprod@pc20 workdir]$ BbkTMGridJobStatus --list --skim MAN-Task01-funny04-R24a1 | grep cleared 435968 V03 as https://lcgrb02.gridpp.rl.ac.uk:9000/AbpHQhLOVxSym4ElzuVdWg cleared Mon Aug 11 22:49:34 2008 <...> [bbrprod@pc20 workdir]$ BbkRetrieveGridJob --force --temp /var/tmp MAN-Task01-funny04-R24a1 435968 BbkRetrieveGridJob error: Could not get output for BbkTMGridTools::BbkTMGridSkimJob=HASH(0xa841508)->gridJobId BbkRetrieveGridJob warning: /skims_435968.out not returned
Abandoning merges that will never complete
If you get a merge job that is never successfully running, for instance if some of the input files are missing, then it needs to be abandoned and the input jobs reset.
Unlike the merge jobs do not have a hold on them so that after three failures they do not get resubmitted without a manual intervention so you should keep an eye on the version numbers of the built and submitted merge jobs. For example to check the versions of the submitted jobs:
[bbrprod@pc20 workdir]$ BbkShowMergeJobs --status 1 MAN-Task01-R24a3 merge job id job version rundirectory ------------------------------------------------------------------------------- 1028 1 /mnt/data/export/tasks/merge/tasks/MAN-Task01-R24a3/01/1028/MergeV01
The status number of the prepared jobs is '0' (zero), not all prepared jobs will be listed just those that have already been submitted at least once already.
If the version number gets above two or three you should look at the log files of the previous version and try to work out what is the problem. If it looks as though the merge job will never succeed (say the input files are missing or corrupt) then you need to reset the merge super job.
To get the merge super id for a merge job run:
[bbrprod@pc20 workdir]$ BbkTMUser --dbsite man --dbname stm1 --ms_taskname=MAN-Task01-R24a3 --mjob_id=1028 ms_id MS_ID 26 1 rows returned from stm1 at man
Then to abandon that merge super job and reset the skim jobs:
BbkCheckMerges --abandon --mergesuper=26 MAN-Task01-R24a3
Checking the input files for a merge job
[bbrprod@pc20 workdir]$ for file in `BbkTMUser --dbsite man --dbname stm1 --quiet --ms_taskname=MAN-Task01-R24a3 --mjob_id=1028 sfile_name` > do > KanFileCheck $file 2>/dev/null > done /prod/skims/MAN-Task01-R24a3/09/9711/SkimV01/BToPhi3K.01.root exists /prod/skims/MAN-Task01-R24a3/09/9712/SkimV01/BToPhi3K.01.root exists /prod/skims/MAN-Task01-R24a3/09/9713/SkimV01/BToPhi3K.01.root exists /prod/skims/MAN-Task01-R24a3/09/9714/SkimV01/BToPhi3K.01.root exists ... /prod/skims/MAN-Task01-R24a3/10/10197/SkimV01/BToPhi3K.01.root exists
Stopping sprite from performing an action for a period
Sometimes when you hit a problem it is useful to stop sprite from preforming a specific action for a period while you look into it or put a fix in place.
Sprite keeps track of it's previous actions by touching files in the Skim task directory:
[bbrprod@pc20 workdir]$ ls -1 ~/skimming/tasks/MAN-Task01-R24a3/.sprite_* /home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_check_merges /home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_check_skims /home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_create_merges /home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_create_skims /home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_export_data /home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_fix_hanging /home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_submit_merge_jobs /home/bbrprod/skimming/tasks/MAN-Task01-R24a3/.sprite_submit_skim_jobs
(defined in the config file as SkimRunPrefix) and then looking at the timestamp on the files. It is therefore quite easy then to touch the one or more of the files in the future to prevent that action being performed until that time.
So to stop Sprite from trying to export any data for the next four hours you would run:
touch --date `date --date 'now + 4hours'` ~/skimming/tasks/MAN-Task01-R24a3/.sprite_export_data
or to stop anything happening for the next day:
touch --date `date --date 'now + 24hours'` ~/skimming/tasks/MAN-Task01-R24a3/.sprite_*
Checking the existence of a file or collection in the xrootd system
You can use KanCollUtil and KanFileCheck to see if a collection or file is good in the xrootd system.
KanFileCheck takes the file name (and probably produces a load of useless warnings):
[bbrprod@pc20 workdir]$ KanFileCheck /prod/skims/MAN-Task01-R24a3/10/10111/SkimV01/BToPhi3K.01.root Warning in <TClass::TClass>: no dictionary for class KanArray_String is available Warning in <TClass::TClass>: no dictionary for class KanArray is available Warning in <TClass::TClass>: no dictionary for class KanBranch is available ... Warning in <TClass::TClass>: no dictionary for class BtaCandVtxI is available /prod/skims/MAN-Task01-R24a3/10/10111/SkimV01/BToPhi3K.01.root exists
KanCollUtil takes the collection name (the file name with the .XXxxx.root stripped off:
[bbrprod@pc20 workdir]$ KanCollUtil /prod/skims/MAN-Task01-R24a3/10/10111/SkimV01/BToPhi3K /prod/skims/MAN-Task01-R24a3/10/10111/SkimV01/BToPhi3K (299 events)
These will work for the input collections and output from the skim and merge jobs. The only thing to not is that the output of the merge jobs have /prod appended to the start of the file and collection names in xrootd to keep them in a separate area so:
[bbrprod@pc20 workdir]$ BbkTMUser --dbsite man --dbname stm1 --ms_taskname=MAN-Task01-R24a3 --mjob_id=1027 mfile_name MFILE_NAME /store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983.01.root 1 rows returned from stm1 at man You have new mail in /var/spool/mail/bbrprod [bbrprod@pc20 workdir]$ KanFileCheck /store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983.01.root Error in <TXNetFile::CreateXClient>: open attempt failed on root://xrootd01.tier2.hep.manchester.ac.uk//store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983.01.root /store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983.01.root does not exist [bbrprod@pc20 workdir]$ KanFileCheck /prod/store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983.01.root 2>/dev/null /prod/store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983.01.root exists [bbrprod@pc20 workdir]$ KanCollUtil /prod/store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983 /prod/store/PRskims/R24/24.3.3b/BToPXX/59/BToPXX_135983 (220463 events)
Cleaning up the exported files from the export disk
At the moment the export command does not clean up files that have successfully been exported to SLAC and we need to keep an eye on the space on the export disk:
To check the disk space use df:
[bbrprod@pc20 workdir]$ df -h /mnt/data Filesystem Size Used Avail Use% Mounted on /dev/sdb1 917G 554G 317G 64% /mnt/data
To actually delete files that have been exported and are good at SLAC:
for file in `BbkTMUser --dbsite=man --dbname stm1 --ms_taskname 'MAN-Task01-R24a?' --ms_proc_status=3 --distinct --quiet mfile_name` do if [ -e /mnt/data/export$file ] then rm /mnt/data/export$file fi done
(if you just want to produce a list replace the rm with echo).
Another source of data use is merge files that have been retrieved off xrootd for export but then have been abandoned because another merge job has missing input data.
for file in `BbkTMUser --dbsite=man --dbname stm1 --ms_taskname 'MAN-Task01-R24a?' --mjob_status=6 --distinct --quiet mfile_name` do if [ -e /mnt/data/export$file ] then rm /mnt/data/export$file fi done
I suspect that we should be also running something like this to remove the exported files from xrootd as well since I think that step also fails.
Changing the Resource Broker the Grid Jobs are Submitted Via
Which Resource Broker is used is controlled by an entry in a file passed to the job submission edg-job-submit
via the --config-vo
option. For the production running (and other running unless there is a need to change it this should be the file /home/bbrprod/ui.conf
which should contain two uncommented lines like:
NSAddresses = {"lcgrb02.gridpp.rl.ac.uk:7772"}; LBAddresses = {{"lcgrb02.gridpp.rl.ac.uk:9000"}};
Both of these lines need to be edited to change the RB and should point to the same server (the port numbers [the bit after the ':'] are different though).
You can get a list of possible RBs using the lcg-infosites
command:
[bbrprod@pc20 workdir]$ lcg-infosites --vo babar rb lcgrb02.gridpp.rl.ac.uk:7772 grid014.ct.infn.it:7772 gridrb.fe.infn.it:7772 rb-fzk.gridka.de:7772 gram://prod-rb-01.pd.infn.it:7772 gfe01.hep.ph.ic.ac.uk:7772 rb-2-fzk.gridka.de:7772 lcgrb01.gridpp.rl.ac.uk:7772 egee-rb-01.cnaf.infn.it:7772 rb-1-fzk.gridka.de:7772
We've generally used lcgrb01 and lcgrb02 at RAL and egee-rb-01 at CNAF