Grid user crash course

From GridPP Wiki
Revision as of 15:23, 23 October 2013 by Stephen jones (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Using the Grid

The information herein is summarized from the currently most comprehensive document for new Grid users, i.e. gLite User Guide. That document is long, but it's worth knowing what it contains. A link it can be found on this page: http://www.gridpp.ac.uk/deployment/users/

This is a crash course on using the grid. Hopefully, if you use this course, your jobs won't crash. I assume from the start that you already have a configured UI at your disposal, although we have a tutorial on that too, should you need it.


Getting a grid certificate

Getting a grid certificate is a multi-step process, and is run by the UK e-science Certificate Authority, which is part of the National Grid Service. To start the process, you need to choose a web browser (e.g. Firefox) that you will have consistent access to (you need to use the same system for both requesting your certificate and retrieving it when it's ready). Don't use a temporary login anywhere.

Using the browser visit https://ca.grid-support.ac.uk/ and select the 'Request a Certificate' option, then choose 'User Certificate', fill in your personal details (using your departmental email address) and select the appropriate registration authority (RA) for your site. Further instructions will then be emailed to you; once that has happened you should get a further email from the RA staff, and you'll then need to visit them in person, with some photo-id.

Joining a VO

Your grid certificate identifies you to the grid as an individual user, but it's not enough on its own to allow you to run jobs; you also need to join a Virtual Organisation (VO). These are essentially just user groups, typically one per experiment, and individual grid sites can choose to support (or not) work by users of a particular VO. Most sites support the four LHC VOs, fewer support the smaller experiments. The sign-up procedures vary from VO to VO, UK ones typically require a manual approval step, LHC ones require an active CERN account.

For anyone that's interested in using the grid, but is not working on an experiment with an existing VO, some sites have a local VO that can be used to get started.

Running jobs

Lifecycle of a grid job

A grid job passes through quite a few systems between your writing it and a node finally running it. In outline the systems involved are:

  • The User Interface (UI); an interactive machine that you can log into directly to prepare your job before submitting it. Once it's ready you can send the job to:
  • A Workload Management System (WMS); one of a few central service machines (typically at RAL or CERN) that take the job and select a suitable destination cluster to run it on, depending on the job's requirements, available resources, and load on different sites. Once the WMS has selected a suitable cluster to run the job it sends it to:
  • A Computing Element (CE); which takes the job and submits it to that CE's local batch system, where the job may wait for a time before being submitted to a:
  • Worker Node (WN); a fast system that calls back to the WMS and retrieves the job's input files and main script, and runs the job. When the job is completed output is sent back to the WMS, from where it can be retrieved by the user, using:
  • The User Interface (UI) machine.

A minimal job

On a typical local batch system a user would create a simple script that runs their job, and submit that to the queuing system along with any particular resource requirements, for example:

$ qsub -q short -l cput=00:20:00,mem=3G ./myjob

On a grid the range of available systems to run on is much broader, and so describing the requirements can be more complicated. A grid job's requirements are split out into a separate file written in Job Description Language (JDL), e.g. ./myjob might contain:

Executable = "testjob.sh";
StdOutput = "testjob.out";
StdError = "testjob.err";
InputSandbox = {"./testjob.sh"};
OutputSandbox = {"testjob.out","testjob.err"};
Requirements = other.GlueCEInfoHostName == "t2ce06.physics.ox.ac.uk";

This JDL file describes the files that are directly submitted to and retrieved from the grid (the 'sandbox' entries), and gives a single simple constraint that restricts the job to a particular computing element, and hence a particular cluster, in this case the Oxford cluster. The 'testjob.sh' script should be an executable that can run on the target Worker Node; in this case we'll use a minimal Bash script that will tell us when and where our job actually runs:

#!/bin/bash
date
id
/bin/hostname
sleep 30s

Before we can submit our job we must create a temporary 'proxy certificate' from our main grid certificate. This proxy is then delegated to the WMS and then sent with the job to identify its ownership, but unlike the main certificate it has a short expiry date (usually 12 hours after it's created). You only normally need to do this once per session, not once per job. So a complete session submitting a job looks like:

Get a proxy:

$ voms-proxy-init --voms vo.southgrid.ac.uk
Enter GRID pass phrase:
Your identity: /C=UK/O=eScience/OU=Oxford/L=OeSC/CN=j bloggs
Creating temporary proxy ..................................... Done
Contactingvoms.gridpp.ac.uk:15019
[/C=UK/O=eScience/OU=Manchester/L=HEP/CN=voms.gridpp.ac.uk/Email=hostmaster@hep.man.ac
.uk] "vo.southgrid.ac.uk" Done
Creating proxy ............................................... Done
Your proxy is valid until Thu Jun 5 03:59:05 2008

This will select a VOMS server to use (perhaps one of many). Alternatively, use this example if you wish to use a specific VOMS server.

$ ls -lrt /etc/vomses/atlas-*
-rw-r--r-- 1 root root 93 Jun  3 16:58 /etc/vomses/atlas-lcg-voms.cern.ch
-rw-r--r-- 1 root root 85 Jun  3 16:58 /etc/vomses/atlas-voms.cern.ch
-rw-r--r-- 1 root root 95 Jun  3 16:58 /etc/vomses/atlas-vo.racf.bnl.gov

$ voms-proxy-init --voms atlas --vomses /etc/vomses/atlas-vo.racf.bnl.gov 
Enter GRID pass phrase:
Your identity: /C=UK/O=eScience/OU=Liverpool/L=CSD/CN=stephen jones
Creating temporary proxy ...................................................... Done
Contacting  vo.racf.bnl.gov:15003 
...

Delegate a copy of it to the WMS:

$ glite-wms-job-delegate-proxy -d joebloggs
Connecting to the service
https://lcgwms02.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
================== glite-wms-job-delegate-proxy Success ==================
Your proxy has been successfully delegated to the WMProxy:
https://lcgwms02.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
with the delegation identifier: joebloggs

Submit the job:

$ glite-wms-job-submit -d joebloggs testjob.jdl
Connecting to the service
https://lcgwms02.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
====================== glite-wms-job-submit Success ======================
The job has been successfully submitted to the WMProxy
Your job identifier is:
https://lcglb01.gridpp.rl.ac.uk:9000/diKd5x0szH_6oWghTSWQLs

Check the job status:

$ glite-wms-job-status https://lcglb01.gridpp.rl.ac.uk:9000/diKd5x0szH_6oWghTSWQLs
*************************************************************
BOOKKEEPING INFORMATION:
Status info for the Job : https://lcglb01.gridpp.rl.ac.uk:9000/diKd5x0szH_6oWghTSWQLs
Current Status: Scheduled
Status Reason: Job successfully submitted to Globus
Destination: t2ce05.physics.ox.ac.uk:2119/jobmanager-lcgpbs-mediumfive
Submitted: Mon Oct 13 17:53:07 2008 BST
*************************************************************

This stage can be repeated until the job is done:

$ glite-wms-job-status https://lcglb01.gridpp.rl.ac.uk:9000/diKd5x0szH_6oWghTSWQLs
*************************************************************
BOOKKEEPING INFORMATION:
Status info for the Job : https://lcglb01.gridpp.rl.ac.uk:9000/rmSxurTX8W6eXBqoYcl_8Q
Current Status: Done (Success)
Status Reason: Job terminated successfully
Destination: t2ce05.physics.ox.ac.uk:2119/jobmanager-lcgpbs-mediumfive
Submitted: Mon Oct 13 17:53:07 2008 BST
*************************************************************

Retrieve the job output:

$ glite-wms-job-output https://lcglb01.gridpp.rl.ac.uk:9000/diKd5x0szH_6oWghTSWQLs
Connecting to the service
https://lcgwms02.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
================================================================================
JOB GET OUTPUT OUTCOME
Output sandbox files for the job:
https://lcglb01.gridpp.rl.ac.uk:9000/diKd5x0szH_6oWghTSWQLs
have been successfully retrieved and stored in the directory:
/tmp/jobOutput/joebloggs_diKd5x0szH_6oWghTSWQLs
================================================================================

If we then look at the retrieved output files we can see the job's standard output and standard error, and by looking at the job output discover where it ran;

$ ls -l /tmp/jobOutput/joebloggs_diKd5x0szH_6oWghTSWQLs
-rw-r--r-- 1 joebloggs staff 0 Jun 4 16:22 testJob.err
-rw-r--r-- 1 joebloggs staff 720 Jun 4 16:23 testJob.out
$ cat /tmp/jobOutput/joebloggs_diKd5x0szH_6oWghTSWQLs/testJob.out
Wed Jun 4 16:03:18 BST 2008
uid=13001(stg001) gid=13000(southgrid) groups=13000(southgrid)
t2wn41.physics.ox.ac.uk

So, this job ran at just past four on a worker node in Oxford. Other job submission command(s):

  • glite-wms-job-list-match - This is a useful check before actually submitting a job. This asks the WMS for a list of CEs that could be sent the job. If no CEs are returned then the job's constraints are too tight.

Data management

Data storage on the grid is based around Storage Element (SE) systems and Logical File Catalogues (LFC). The former store data, the latter keep track of the locations of the data. There is typically one SE at each grid site, and one LFC per VO. Usually each site will have a single SE with fast networking between it and the site's computational worker nodes. While jobs that access their data files using grid access tools can run at one site while accessing data stored at another, doing so for large volumes of data will be much slower than using local access. Once a file is stored on an SE it can be replicated to other sites, allowing jobs to run with local access at multiple locations.

Finding your LFC

The grid information system advertises which LFC(s) support each VO, and you can query it for your VO using 'lcg-infosites':

$ lcg-infosites lfc --vo atlas
prod-lfc-atlas-central.cern.ch

Note: you can use lfc.gridpp.rl.ac.uk for the uk dteam VO.

Once you've got the LFC, you can set it in your environment to avoid specifying it each time:

export LFC_HOST=prod-lfc-atlas-central.cern.ch

LFCs have analogues of standard filesystem tools to query and manipulate them:

  • lfc-ls - show directory listings
  • lfc-mkdir - make new subdirectories
  • lfc-rm - remove LFC entries (note: this doesn't actually delete the files, it's only useful for tidying up old LFC entries)

LFCs have a typical directory structure that starts with /grid/ and has subdirectories for each VO:

$ lfc-ls /
grid
$ lfc-ls /grid
t2k.org
totalep
vo.southgrid.ac.uk
zeus

Unless you're using a particular VO specified location then you can create a subdirectory for your own files and set it to be the default location for your LFC entries:

$ lfc-mkdir /grid/vo.southgrid.ac.uk/joebloggs
$ export LFC_HOME=/grid/vo.southgrid.ac.uk/joebloggs

Putting a local file on the grid

To upload a file from ordinary local storage to the grid, use the 'lcg-cr' (cr= create replica):

$ lcg-cr -d t2se01.physics.ox.ac.uk --vo dteam -l testfile1 file://$(pwd)/inputfile

The parameters shown in this command are:

  • -d: Destination. This example specifies the Oxford Storage Element.
  • --vo: Your VO.
  • -l: Logical filename. An entry will be created in the LFC with this name, pointing to the newly created file.
  • input file: lcg-cr requires and absolute path to the local copy of the file to upload, (e.g. /home/joebloggs/grid/inputfile), this use of the 'pwd' command supplies the full path, assuming the file is in the current directory.

Checking it's there

$ lcg-ls -l lfn:testfile1
-rw-rw-r-- 1 19527 2688 52428800 lfn:/grid/dteam/joebloggs/testfile1
$ lcg-lr lfn:testfile1
srm://t2se01.physics.ox.ac.uk/dpm/physics.ox.ac.uk/home/dteam/generated/
2008-10-15/filee54dbfc5-aef4-47d2-8823-776540fb5cdd

Putting it somewhere else on the grid:

lcg-rep -d heplnx204.pp.rl.ac.uk lfn:testfile1
lcg-lr lfn:testfile1
srm://epgse1.ph.bham.ac.uk/dpm/ph.bham.ac.uk/home/dteam/generated/
2008-10-15/file38f325e7-6358-4804-8c26-70515cdab4b6
srm://t2se01.physics.ox.ac.uk/dpm/physics.ox.ac.uk/home/dteam/generated/
2008-10-15/filee54dbfc5-aef4-47d2-8823-776540fb5cdd

Deleting one of those replicas:

$ lcg-del srm://t2se01.physics.ox.ac.uk/dpm/physics.ox.ac.uk/home/dteam/generated/
2008-10-15/filee54dbfc5-aef4-47d2-8823-776540fb5cdd
$ lcg-lr lfn:testfile1
srm://epgse1.ph.bham.ac.uk/dpm/ph.bham.ac.uk/home/dteam/generated/
2008-10-15/file38f325e7-6358-4804-8c26-70515cdab4b6

Copying it back to the local system:

lcg-cp lfn:testfile1 ./outputfile

And delete the last one and the name entry in the LFC:

lcg-del -a lfn:testfile1
$ lcg-lr lfn:testfile1
[LFC] prod-lfc-shared-central.cern.ch: /grid/dteam/joebloggs/testfile1:
No such file or directory
lcg_lr: No such file or directory