Running BaBar SP Validation Jobs at RAL

From GridPP Wiki
Jump to: navigation, search

Building the Validation Jobs

Build the validation runs as you would any other type of run but with the -j valid option:

spbuild-grid --grid CE=<ce-name> -j valid --user <username> -V <n> <start-run>-<end-run>

For the grid or

spbuild -j valid --user <username> <start-run>-<end-run>

for local production.

Note each gridsite has to be validated separately so you'll need to generate and run the full set of validation runs for each site.

Running the Validation Jobs

Running the validation jobs is identical to running the production jobs except there is no merge step so after you have run spsub -y <start-run>-<end-run> (and recovered and unpacked the output if you are running on the grid), you are ready to copy the log and histogram files to SLAC for comparison.

Transferring the Log and Histogram files to SLAC

First use the saveDir command to bundle up the necessary files into tarballs:

saveDir --nodb -t $ALLRUNS/../validation/ <start-run>-<end-run>

This creates a number of ~100MB tar files in $ALLRUNS/../validation/ each containing the output of a number of runs.

Transfer these files to SLAC and unpack them in the validation directory $BFROOT/prod/log/validation/ral for RAL and $BFROOT/prod/log/validation/uk-spgrid/<GRIDSITE> when running on the grid.

Space on the SLAC validation disk is limited so we don't keep any old versions or the runs on disk to remove the run directories from your target before you start. We do keep the tarballs on the RAL validation disk just in case.

I usually scp the files to /tmp/<username> on one of the norics and tar -xzf <tarfile> -C $BFROOT/prod/log/validation/ral to extract the files.

For running on the grid the following snippet is quite useful for extracting a single version:

cd /tmp/<username>
for file in *.tar
do
tar -tf $file | grep -E 'V0<n>|status.txt' | tar -xf $file -C $BFROOT/prod/log/validation/uk-spgrid/<GRIDSITE>
done 

Comparing the validation runs with the reference runs

This is where things can get a little complex. Comparing agint the reference runs itself is quite easy:

validate.csh <first run> <last run> <site1> <version1> <site2> <version2>

Where <siten> is actually a path so to compare of a Grid site under uk-spgrid put uk-spgrid/<GRIDSITE>

This should produce nice clean output like:

   Checking run 5990265 : Number of discrepancies: 0
   Checking run 5990266 : Number of discrepancies: 0
   Checking run 5990267 : Number of discrepancies: 0
   Checking run 5990268 : Number of discrepancies: 0
...

But there are two problems: First RAL (and most of the grid sites) have mixed Intel and AMD CPUs and you have to validate on the same CPU type (runs from the wrong CPU type with have approx 2000 discrepancies), the only way I've found to do this is to validate first against slac and the for the runs with about 2000 discrepancies revalidate against slac_intel.

The second problem is that there are now some runs which produce a small number (less than twenty) of discrepancies even when validated on the same machine, so even after revalidating the non-zero runs you will sill have a few discrepancies.

Once you are down to a few runs with less than about twenty discrepancies post the results to the SimuProd hypernews to let Peter see them. He will assign you production runs if he agrees that the validation is successful.

Chris brew 16:16, 24 Jul 2007 (BST)