Monte Carlo production

From GridPP Wiki
Jump to: navigation, search

This page documents the steps used to submit Monte Carlo production jobs to the Manchester Tier2 centre using the Grid. It uses simulation/digitisation of ExhumeBB files as an example. It assumes that event generation (evgen) files have been produced locally. Other production tasks (e.g., Wγγ) also continue through to reconstruction.

Checkout MC production code

First, checkout the Monte Carlo production code. To do this, look at (or create) your $HOME/.ssh/config file and check that it contains the following lines:

host *
ForwardAgent yes

Host isscvs.cern.ch isscvs
User "Your CERN username, i.e the username you use to login to lxplus"
Protocol 2
ForwardX11 yes
PasswordAuthentication no

Then at the command prompt, execute the following commands. Change the mcprodwork directory to whatever you like, and replace miyagawa with your own CERN username.

mkdir mcprodwork 
cd mcprodwork
export CVSROOT=:ext:miyagawa@isscvs.cern.ch:/local/reps/atman
export CVS_RSH=ssh
ssh-add
cvs co MCProd

This will create a directory mcprodwork/MCProd which contains the scripts to generate and run the Grid jobs. This section needs to be executed just once.

Generate files for Grid submission

This section needs to be completed for each new sample you want to simulate. It is recommended that you generate the files for Grid submission in a directory exmore20 (this can be named as you desire).

cd mcprodwork/MCProd 
mkdir exmore20
cd src

exmore20.sh (rename it as you wish) is the script which generates the files needed. Edit the variables as needed to match the parameters of your job. In this example, the event generation files are located in /afs/hep.man.ac.uk/d/atlas-fp420/BJetEnergyScale/Generation/FSRMore/20GeV, and the digitisation files will be written to /afs/hep.man.ac.uk/d/atlas-fp420/BJetEnergyScale/digi/FSRMore/20GeV. The files will be registered on the Grid in a Logical File Catalogue (LFC) at the location /grid/atlas/users/miyagawa/exhumebb/fsrmore/20GeV. The main variables you may wish to modify are:

  • lfchost: Logical File Catalogue (LFC) in which you wish to register your files on the Grid.
  • username: Setting this to your CERN username is known to work.
  • griddir: Directory path below /grid/atlas/users/miyagawa in the LFC.
  • evgendir: Directory path below /afs/hep.man.ac.uk/d/atlas-fp420/BJetEnergyScale/Generation in which the event generation files are stored locally.
  • digidir: Directory path below /afs/hep.man.ac.uk/d/atlas-fp420/BJetEnergyScale/digi in which the digitisation files are to be stored locally.
  • jdldir: Local directory (exmore20 in this example) in which to save the files for Grid submission.
  • cename: Computing element on the Grid on which you want to run the jobs.
  • nevgen: Number of event generation files. The files should be numbered from 1.
  • evgenhead: Name of the dataset. In this example, the evgen files are assumed to be of the form exhume.bbbar.20Gev.fsrmore.5000.1.pool.root.

Once you have modified the file as you wish, generate the submission files:

exmore20.sh 
cd ../exmore20

There should be the following files:

  • upload.sh: Uploads the event generation files to the Grid.
  • run_simul.sh: The 'executable' for the Grid jobs.
  • submit.sh: Script to submit the Grid jobs.
  • sub0.txt: File containing the IDs of the jobs to submit.
  • Lots (2000 in this example) of jobX.jdl files. These set the parameters for each Grid job.
  • getstatus.sh: Script to check the status of the Grid jobs.
  • checkhits.sh: Script to check which hit files have been produced.
  • checkdigits.sh: Script to check which digit files have been produced.
  • getdigits.sh: Script to retrieve the digit files from the Grid to the local datadisk.

Grid setup

Each time you login to a different machine, you need to execute the commands in this section to setup your Grid proxy. It is best to submit jobs from linux8.hep.man.ac.uk. Once you login, you should setup your Grid proxy for a sufficient time. For example, to get 100 hours and 42 minutes:

voms-proxy-init -voms atlas -valid 100:42

If the environment variable VO_ATLAS_DEFAULT_SE is not already defined, you will need to set it before you can submit jobs:

export VO_ATLAS_DEFAULT_SE=$VO_DTEAM_DEFAULT_SE

Uploading evgen files

The first step is to upload the event generation files onto the Grid and register the files in a Logical File Catalogue (LFC). This needs to be done only once; once the files are registered in the catalogue, they will remain there. To do so, from the exmore20 directory:

upload.sh

That is all that needs to be done, and you can progress to the next section. The following subsections are included for further understanding of what is happening.

Set Logical File Catalogue (LFC)

The catalogues currently used by Manchester folk are:

  • lfc0448.gridpp.rl.ac.uk
  • lfc-atlas-test.cern.ch
  • prod-lfc-atlas-local.cern.ch

The catalogue is set using the environment variable LFC_HOST:

export LFC_HOST=lfc0448.gridpp.rl.ac.uk

Create directories in LFC

The directory structure (specified by Andy) in which to save your files is created in the catalogue. Use your lxplus username (it may be possible to use an arbitrary username, but this is untested). Unfortunately, you cannot write to another user's area:

lfc-mkdir -p /grid/atlas/users/miyagawa/exhumebb/fsrmore/20GeV/evgen 
lfc-mkdir -p /grid/atlas/users/miyagawa/exhumebb/fsrmore/20GeV/simul
lfc-mkdir -p /grid/atlas/users/miyagawa/exhumebb/fsrmore/20GeV/digi
lfc-mkdir -p /grid/atlas/users/miyagawa/exhumebb/fsrmore/20GeV/log

If you are also doing reconstruction, these additional directories will also be created:

lfc-mkdir /grid/atlas/users/miyagawa/exhumebb/fsrmore/20GeV/esd 
lfc-mkdir /grid/atlas/users/miyagawa/exhumebb/fsrmore/20GeV/aod
lfc-mkdir /grid/atlas/users/miyagawa/exhumebb/fsrmore/20GeV/ntuple

You can list the contents of any directory with the following command (note that wildcards do not work):

lfc-ls /grid/atlas/users/miyagawa/exhumebb/20GeV/evgen

Upload files into LFC

Finally, the files are copied to the Grid and entries are created in the LFC. The various options are:

  • --vo specifies the Virtual Organisation (VO);
  • -l lfn: specifies the entry (logical file name) in the LFC;
  • file:// specifies the file to be uploaded:
lcg-cr -v --vo atlas -d dcache01.tier2.hep.manchester.ac.uk \
  -l lfn:/grid/atlas/users/miyagawa/exhumebb/fsrmore/20GeV/evgen/exhume.bbbar.20Gev.fsrmore.5000.14.pool.root \
  file:///afs/hep.man.ac.uk/d/atlas-fp420/BJetEnergyScale/Generation/FSRMore/20GeV/exhume.bbbar.20Gev.fsrmore.5000.14.pool.root

Submitting simulation + digitisation jobs

Submitting the jobs takes a LONG time, so it is generally a good idea to refresh your AFS token before submitting.

The actual command to submit the jobs is simply:

submit.sh 0

The rest of this section is for explanation only.

The submission script

submit.sh takes one argument, in this case '0', which is expanded to use the input file sub0.txt. This file contains the numbers of the jdl files (jobxxx.jdl) to submit. For the 0th iteration of this example, all 2000 jdl files will be submitted. For future (ith) iterations, the numbers of the jdl files to be resubmitted are saved in subi.txt, and the jobs are resubmitted with the command:

submit.sh i

submit.sh loops over the numbers x in subi.txt and, for each number, issues the command:

edg-job-submit --config ../conf/ic2.conf --config-vo ../conf/atlas-ic2.conf -o jobIDi jobx.jdl

--config specifies the resource broker (the Grid middleware which decides where to send the job and takes care of logging) to use, and --config-vo specifies the VO server to check your qualifications. Two sets are known to work, so you can switch between them. They are found in the directory MCProd/conf:

  • ic2.conf and atlas-ic2.conf
  • lcgrb02.conf and atlas-lcgrb02.conf

-o jobIDi saves the job identifiers (e.g., https://gfe01.hep.ph.ic.ac.uk:9000/zhv22aIhdcAMiG9AdSlx2w) in the file jobIDi. These identifiers are used to check the status of the jobs and retrieve logging information afterwards.

jobx.jdl is the file containing the job configuration.

The job configuration file

The jdl (job description language) file contains the job configuration for each Grid job. A typical jdl file contains:

Executable = "run_simul.sh";
Arguments = "GRID exhume.bbbar.20Gev.fsrmore.5000.1.pool.root exhume.bbbar.20Gev.fsrmore.5000.simul._00001.pool.root
  exhume.bbbar.20Gev.fsrmore.5000.digi._00001.pool.root 50 0 1 ATLAS-CSC-01-02-00
  exhume.bbbar.20Gev.fsrmore.5000.digi.log._00001.job.log.tgz dcache01.tier2.hep.manchester.ac.uk
  /grid/atlas/users/miyagawa/exhumebb/fsrmore/20GeV";
StdOutput = "job1-exhume.bbbar.20Gev.fsrmore.5000.simul.out";
StdError  = "job1-exhume.bbbar.20Gev.fsrmore.5000.simul.err";
InputSandbox  = {"run_simul.sh"};
OutputSandbox = {"job1-exhume.bbbar.20Gev.fsrmore.5000.simul.out","job1-exhume.bbbar.20Gev.fsrmore.5000.simul.err"};
Requirements = other.GlueCEUniqueID == "ce02.tier2.hep.manchester.ac.uk:2119/jobmanager-lcgpbs-atlas";
VirtualOrganisation = "atlas";
  • Executable is the program to run. In this example, we submit a script run_simul.sh.
  • Arguments are passed to the executable.
  • StdOutput specifies the file to which the standard output stream is redirected.
  • StdError specifies the file to which the standard error stream is redirected. It is sometimes convenient for this to be the same as StdError so that you can tell where the error messages occurred.
  • InputSandbox specifies the files which should be sent with the job. In this example, the executable is sent. Other things that could be sent are libraries used by the executable. Note that the sandbox is limited to ~10 MB, which is why the event generation files (~20 MB each) cannot be sent via the sandbox.
  • OutputSandbox specifies the output files which should be retrieved on completion of the job. In this example, the log files are retrieved. Again, because the sandbox is limited in size, the digitisation files cannot be retrieved via the sandbox.
  • Requirements specifies any requirements of the job. This example specifies one of the computing elements (CE) at the Manchester Tier2. Other potential requirements include specifying that a site have a particular version of athena installed.
  • VirtualOrganisation specifies your VO.

The executable

I will not attempt to write a full explanation of run_simul.sh. Instead, I will outline the main commands.

Setup the athena environment

source $VO_ATLAS_SW_DIR/software/$VER/setup.sh 
export CMTPATH=$SITEROOT/AtlasProduction/$PATCH
source $CMTPATH/AtlasProductionRunTime/cmt/setup.sh

Download input (event generation) files from the LFC

export LFC_HOST=$lfchost 
lcg-cp --vo atlas lfn:$griddir/evgen/$INPUTFILE file://`pwd`/$INPUTFILE

lcg-cp copies the Grid file specified by lfn: to the local location file://.

Run the athena transform csc_simul_trf.py which does the simulation and digitisation

csc_simul_trf.py $INPUTFILE $SIMULFILE $DIGIFILE $NUM_EVNT $START_EVNT $RANDOMSEED $GEOM

Save output to the LFC

tar zcf $LOGFILE *.xml *.stdout *.stderr *.log 
lcg-cr -v --vo atlas -d:$dcacheloc -l lfn:$griddir/log/$LOGFILE file://`pwd`/$LOGFILE
lcg-cr -v --vo atlas -d:$dcacheloc -l lfn:$griddir/log/$DIGIFILE file://`pwd`/$DIGIFILE
lcg-cr -v --vo atlas -d:$dcacheloc -l lfn:$griddir/log/$SIMULFILE file://`pwd`/$SIMULFILE

lcg-cr, which we saw in #Upload files into LFC, uploads the output (simulation, digitisation and log) files to the Grid and registers them in the LFC.

Monitoring progress of Grid jobs

To look at the progress of individual jobs:

edg-job-status -i jobIDi

where i matches the number used in the subi.txt. The job will progress through several stages (some of these are guesses):

  • Ready: The job has been prepared and is ready to be submitted to the resource broker.
  • Waiting: The job is waiting to be assigned by the resource broker to a computing element.
  • Scheduled: The job is scheduled to run on a worker node of the computing element.
  • Running: The job is running on the worker node.
  • Done (Success): The job completed "successfully" as defined by the Grid. The definition of success may not match yours.
  • Aborted: Something has gone wrong, so the job will need to be fixed (possibly) and resubmitted (definitely).

To have the status of all jobs listed at once:

getstatus.sh i

where i again matches subi.txt.

As noted, even the Grid claims that a job has completed successfully, reality may be that it was unsuccessful for us because the output (digitisation file) was not saved. Also, sometimes the fact that a job has finished fails to be logged with the resource broker (possibly due to the resource broker being down when the job finishes); this results in the job claiming to be still running. Therefore, it is more reliable to check which output files have been saved to the LFC:

checkdigits.sh

This uses lfc-ls (see #Create directories in LFC) to list the digitisation files already registered in the LFC, checks which ones are still missing, and outputs the missing numbers. Once you are confident all of the jobs have completed (successfully or not), you can resubmit the jobs (x is the jth iteration):

checkdigits.sh > subj.txt
submit.sh j

There is also checkhits.sh, which checks which simulation files are missing. This is usually not used since it is the digitisation files which we are interested in.

Downloading digi files

Once you are satisfied that all (or enough) of your jobs have completed, you can download the digitisation files to the local datadisk. Again, this is one simple command:

getdigits.sh

This lists the digitisation files in the LFC using lfc-ls (see #Create directories in LFC). It then loops over this list and checks which ones have not been downloaded yet. It finally copies files still missing from the datadisk using lcg-cp (see #Download input (event generation) files from the LFC).

Once you have downloaded all digi files, you have finished the MC production.