Difference between revisions of "Grid user crash course"

From GridPP Wiki
Jump to: navigation, search
 
(Data management)
 
(5 intermediate revisions by 3 users not shown)
Line 20: Line 20:
 
get a further email from the RA staff, and  you'll then need to visit them in person, with some photo-id.
 
get a further email from the RA staff, and  you'll then need to visit them in person, with some photo-id.
  
== Joining a VO ==
+
=== Converting your grid certificate ===
Your grid certificate identifies you to the grid as an individual user, but it's not enough on its own to allow you to run jobs; you
+
To convert the new certificate for use in grid jobs, use the openssl pkcs12 command to convert the certificate into the '''certificate''' and '''key''' files you will need to generate proxies for grid work.
also need to join a Virtual Organisation (VO). These are essentially just user groups, typically one per experiment, and individual
+
grid sites can choose to support (or not) work by users of a particular VO. Most sites support the four LHC VOs, fewer support the
+
smaller experiments. The sign-up procedures vary from VO to VO, UK ones typically require a manual approval step, LHC ones
+
require an active CERN account.
+
 
+
For anyone that's interested in using the grid, but is not working on an experiment with an existing VO, some sites have a local VO
+
that can be used to get started.
+
 
+
== Running jobs ==
+
 
+
=== Lifecycle of a grid job ===
+
 
+
A grid job passes through quite a few systems between your writing it and a node finally running it. In outline the systems
+
involved are:
+
* The User Interface (UI); an interactive machine that you can log into directly to prepare your job before submitting it. Once it's ready you can send the job to:
+
* A Workload Management System (WMS); one of a few central service machines (typically at RAL or CERN) that take the job and select a suitable destination cluster to run it on, depending on the job's requirements, available resources, and load on different sites. Once the WMS has selected a suitable cluster to run the job it sends it to:
+
* A Computing Element (CE); which takes the job and submits it to that CE's local batch system, where the job may wait for a time before being submitted to a:
+
* Worker Node (WN); a fast system that calls back to the WMS and retrieves the job's input files and main script, and runs the job. When the job is completed output is sent back to the WMS, from where it can be retrieved by the user, using:
+
* The User Interface (UI) machine.
+
 
+
=== A minimal job ===
+
On a typical local batch system a user would create a simple script that runs their job, and submit that to the queuing system
+
along with any particular resource requirements, for example:
+
  
 
<pre>
 
<pre>
$ qsub -q short -l cput=00:20:00,mem=3G ./myjob
+
$ cd ~/.globus
 +
$ mv [your-cert-file] ./.
 +
$ openssl pkcs12 -in [your-cert-file] -clcerts -nokeys -out usercert.pem
 +
$ openssl pkcs12 -in [your-cert-file] -nocerts -out userkey.pem
 
</pre>
 
</pre>
 +
...where [your-cert-file] is the name and path of your exported certificate file, and [path]/[filename].pem is the name and path for the '''certificate''' and '''key''' file to be generated in an existing directory. ''In response to each command, you will be prompted for your grid password''.
  
On a grid the range of available systems to run on is much broader, and so describing the requirements can be more complicated.
+
You will then need to Change permissions to protect the converted '''key file''':
A grid job's requirements are split out into a separate file written in Job Description Language (JDL), e.g. ./myjob might contain:
+
<pre>
+
Executable = "testjob.sh";
+
StdOutput = "testjob.out";
+
StdError = "testjob.err";
+
InputSandbox = {"./testjob.sh"};
+
OutputSandbox = {"testjob.out","testjob.err"};
+
Requirements = other.GlueCEInfoHostName == "t2ce06.physics.ox.ac.uk";
+
</pre>
+
  
This JDL file describes the files that are directly submitted to and retrieved from the grid (the 'sandbox' entries), and gives a single
 
simple constraint that restricts the job to a particular computing element, and hence a particular cluster, in this case the Oxford
 
cluster. The 'testjob.sh' script should be an executable that can run on the target Worker Node; in this case we'll use a
 
minimal Bash script that will tell us when and where our job actually runs:
 
 
<pre>
 
<pre>
#!/bin/bash
+
$ chmod 600 userkey.pem
date
+
id
+
/bin/hostname
+
sleep 30s
+
 
</pre>
 
</pre>
  
Before we can submit our job we must create a temporary 'proxy certificate' from our main grid certificate. This proxy is then delegated
+
(''Adapted from the instructions [https://www.racf.bnl.gov/docs/howto/grid/installcert here] - thanks!)
to the WMS and then sent with the job to identify its ownership, but unlike the main certificate it has a short expiry date (usually
+
12 hours after it's created). You only normally need to do this once per session, not once per job. So a complete session
+
submitting a job looks like:
+
  
<b>Get a proxy:</b>
+
== Joining a VO ==
<pre>
+
Your grid certificate identifies you to the grid as an individual user, but it's not enough on its own to allow you to run jobs; you
$ voms-proxy-init --voms vo.southgrid.ac.uk
+
also need to join a Virtual Organisation (VO). These are essentially just user groups, typically one per experiment, and individual
Enter GRID pass phrase:
+
grid sites can choose to support (or not) work by users of a particular VO. Most sites support the four LHC VOs, fewer support the
Your identity: /C=UK/O=eScience/OU=Oxford/L=OeSC/CN=j bloggs
+
smaller experiments. The sign-up procedures vary from VO to VO, UK ones typically require a manual approval step, LHC ones
Creating temporary proxy ..................................... Done
+
require an active CERN account.
Contactingvoms.gridpp.ac.uk:15019
+
[/C=UK/O=eScience/OU=Manchester/L=HEP/CN=voms.gridpp.ac.uk/Email=hostmaster@hep.man.ac
+
.uk] "vo.southgrid.ac.uk" Done
+
Creating proxy ............................................... Done
+
Your proxy is valid until Thu Jun 5 03:59:05 2008
+
</pre>
+
  
This will select a VOMS server to use (perhaps one of many). Alternatively, use this example if you wish to use a specific VOMS server.
+
For anyone that's interested in using the grid, but is not working on an experiment with an existing VO, some sites have a local VO
 +
that can be used to get started.
  
<pre>
+
* [[GridPP_approved_VOs|GridPP approved VOs]]
$ ls -lrt /etc/vomses/atlas-*
+
-rw-r--r-- 1 root root 93 Jun  3 16:58 /etc/vomses/atlas-lcg-voms.cern.ch
+
-rw-r--r-- 1 root root 85 Jun  3 16:58 /etc/vomses/atlas-voms.cern.ch
+
-rw-r--r-- 1 root root 95 Jun  3 16:58 /etc/vomses/atlas-vo.racf.bnl.gov
+
  
$ voms-proxy-init --voms atlas --vomses /etc/vomses/atlas-vo.racf.bnl.gov
+
== Running jobs (this section is obsolete) ==
Enter GRID pass phrase:
+
Your identity: /C=UK/O=eScience/OU=Liverpool/L=CSD/CN=stephen jones
+
Creating temporary proxy ...................................................... Done
+
Contacting  vo.racf.bnl.gov:15003
+
...
+
</pre>
+
  
<b>Delegate a copy of it to the WMS:</b>
+
Please refer to the introduction to the [https://www.gridpp.ac.uk/wiki/Quick_Guide_to_Dirac GridPP DIRAC server] instead.
<pre>
+
$ glite-wms-job-delegate-proxy -d joebloggs
+
Connecting to the service
+
https://lcgwms02.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
+
================== glite-wms-job-delegate-proxy Success ==================
+
Your proxy has been successfully delegated to the WMProxy:
+
https://lcgwms02.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
+
with the delegation identifier: joebloggs
+
</pre>
+
<b>Submit the job:</b>
+
<pre>
+
$ glite-wms-job-submit -d joebloggs testjob.jdl
+
Connecting to the service
+
https://lcgwms02.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
+
====================== glite-wms-job-submit Success ======================
+
The job has been successfully submitted to the WMProxy
+
Your job identifier is:
+
https://lcglb01.gridpp.rl.ac.uk:9000/diKd5x0szH_6oWghTSWQLs
+
</pre>
+
<b>Check the job status:</b>
+
<pre>
+
$ glite-wms-job-status https://lcglb01.gridpp.rl.ac.uk:9000/diKd5x0szH_6oWghTSWQLs
+
*************************************************************
+
BOOKKEEPING INFORMATION:
+
Status info for the Job : https://lcglb01.gridpp.rl.ac.uk:9000/diKd5x0szH_6oWghTSWQLs
+
Current Status: Scheduled
+
Status Reason: Job successfully submitted to Globus
+
Destination: t2ce05.physics.ox.ac.uk:2119/jobmanager-lcgpbs-mediumfive
+
Submitted: Mon Oct 13 17:53:07 2008 BST
+
*************************************************************
+
</pre>
+
  
<b>This stage can be repeated until the job is done:</b>
+
== Data management (this section is obsolete) ==  
 
+
Data storage on the grid is based around Storage Element (SE) systems and DIRAC File Catalogue (DFC). The former store
<pre>
+
data, the latter keep track of the locations of the data.  
$ glite-wms-job-status https://lcglb01.gridpp.rl.ac.uk:9000/diKd5x0szH_6oWghTSWQLs
+
Please see the introduction to [https://www.gridpp.ac.uk/wiki/Quick_Guide_to_Dirac GridPP DIRAC] for basic data management commands.
*************************************************************
+
BOOKKEEPING INFORMATION:
+
Status info for the Job : https://lcglb01.gridpp.rl.ac.uk:9000/rmSxurTX8W6eXBqoYcl_8Q
+
Current Status: Done (Success)
+
Status Reason: Job terminated successfully
+
Destination: t2ce05.physics.ox.ac.uk:2119/jobmanager-lcgpbs-mediumfive
+
Submitted: Mon Oct 13 17:53:07 2008 BST
+
*************************************************************
+
</pre>
+
<b>Retrieve the job output:</b>
+
<pre>
+
$ glite-wms-job-output https://lcglb01.gridpp.rl.ac.uk:9000/diKd5x0szH_6oWghTSWQLs
+
Connecting to the service
+
https://lcgwms02.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
+
================================================================================
+
JOB GET OUTPUT OUTCOME
+
Output sandbox files for the job:
+
https://lcglb01.gridpp.rl.ac.uk:9000/diKd5x0szH_6oWghTSWQLs
+
have been successfully retrieved and stored in the directory:
+
/tmp/jobOutput/joebloggs_diKd5x0szH_6oWghTSWQLs
+
================================================================================
+
</pre>
+
 
+
If we then look at the retrieved output files we can see the job's standard output and standard error, and by looking at the job
+
output discover where it ran;
+
<pre>
+
$ ls -l /tmp/jobOutput/joebloggs_diKd5x0szH_6oWghTSWQLs
+
-rw-r--r-- 1 joebloggs staff 0 Jun 4 16:22 testJob.err
+
-rw-r--r-- 1 joebloggs staff 720 Jun 4 16:23 testJob.out
+
$ cat /tmp/jobOutput/joebloggs_diKd5x0szH_6oWghTSWQLs/testJob.out
+
Wed Jun 4 16:03:18 BST 2008
+
uid=13001(stg001) gid=13000(southgrid) groups=13000(southgrid)
+
t2wn41.physics.ox.ac.uk
+
</pre>
+
 
+
So, this job ran at just past four on a worker node in Oxford.
+
Other job submission command(s):
+
* glite-wms-job-list-match - This is a useful check before actually submitting a job. This asks the WMS for a list of CEs that could be sent the job. If no CEs are returned then the job's constraints are too tight.
+
 
+
== Data management ==  
+
Data storage on the grid is based around Storage Element (SE) systems and Logical File Catalogues (LFC). The former store
+
data, the latter keep track of the locations of the data. There is typically one SE at each grid site, and one LFC per VO. Usually
+
each site will have a single SE with fast networking between it and the site's computational worker nodes. While jobs that access
+
their data files using grid access tools can run at one site while accessing data stored at another, doing so for large volumes of data
+
will be much slower than using local access. Once a file is stored on an SE it can be replicated to other sites, allowing jobs to run
+
with local access at multiple locations.
+
=== Finding your LFC ===
+
The grid information system advertises which LFC(s) support each VO, and you can query it for your VO using 'lcg-infosites':
+
<pre>
+
$ lcg-infosites lfc --vo atlas
+
prod-lfc-atlas-central.cern.ch
+
</pre>
+
 
+
Note: you can use lfc.gridpp.rl.ac.uk for the uk dteam VO.
+
 
+
Once you've got the LFC, you can set it in your environment to avoid specifying it each time:
+
<pre>
+
export LFC_HOST=prod-lfc-atlas-central.cern.ch
+
</pre>
+
LFCs have analogues of standard filesystem tools to query and manipulate them:
+
*  lfc-ls - show directory listings
+
*  lfc-mkdir - make new subdirectories
+
*  lfc-rm - remove LFC entries (note: this doesn't actually delete the files, it's only useful for tidying up old LFC entries)
+
LFCs have a typical directory structure that starts with /grid/ and has subdirectories for each VO:
+
<pre>
+
$ lfc-ls /
+
grid
+
$ lfc-ls /grid
+
t2k.org
+
totalep
+
vo.southgrid.ac.uk
+
zeus
+
</pre>
+
Unless you're using a particular VO specified location then you can create a subdirectory for your own files and set it to be the
+
default location for your LFC entries:
+
<pre>
+
$ lfc-mkdir /grid/vo.southgrid.ac.uk/joebloggs
+
$ export LFC_HOME=/grid/vo.southgrid.ac.uk/joebloggs
+
</pre>
+
 
+
=== Putting a local file on the grid ===
+
To upload a file from ordinary local storage to the grid, use the 'lcg-cr' (cr= create replica):
+
<pre>
+
$ lcg-cr -d t2se01.physics.ox.ac.uk --vo dteam -l testfile1 file://$(pwd)/inputfile
+
</pre>
+
The parameters shown in this command are:
+
*  -d: Destination. This example specifies the Oxford Storage Element.
+
*  --vo: Your VO.
+
*  -l: Logical filename. An entry will be created in the LFC with this name, pointing to the newly created file.
+
*  input file: lcg-cr requires and absolute path to the local copy of the file to upload, (e.g. /home/joebloggs/grid/inputfile), this use of the 'pwd' command supplies the full path, assuming the file is in the current directory.
+
 
+
=== Checking it's there ===
+
 
+
<pre>
+
$ lcg-ls -l lfn:testfile1
+
-rw-rw-r-- 1 19527 2688 52428800 lfn:/grid/dteam/joebloggs/testfile1
+
$ lcg-lr lfn:testfile1
+
srm://t2se01.physics.ox.ac.uk/dpm/physics.ox.ac.uk/home/dteam/generated/
+
2008-10-15/filee54dbfc5-aef4-47d2-8823-776540fb5cdd
+
</pre>
+
Putting it somewhere else on the grid:
+
<pre>
+
lcg-rep -d heplnx204.pp.rl.ac.uk lfn:testfile1
+
lcg-lr lfn:testfile1
+
srm://epgse1.ph.bham.ac.uk/dpm/ph.bham.ac.uk/home/dteam/generated/
+
2008-10-15/file38f325e7-6358-4804-8c26-70515cdab4b6
+
srm://t2se01.physics.ox.ac.uk/dpm/physics.ox.ac.uk/home/dteam/generated/
+
2008-10-15/filee54dbfc5-aef4-47d2-8823-776540fb5cdd
+
</pre>
+
Deleting one of those replicas:
+
<pre>
+
$ lcg-del srm://t2se01.physics.ox.ac.uk/dpm/physics.ox.ac.uk/home/dteam/generated/
+
2008-10-15/filee54dbfc5-aef4-47d2-8823-776540fb5cdd
+
$ lcg-lr lfn:testfile1
+
srm://epgse1.ph.bham.ac.uk/dpm/ph.bham.ac.uk/home/dteam/generated/
+
2008-10-15/file38f325e7-6358-4804-8c26-70515cdab4b6
+
</pre>
+
Copying it back to the local system:
+
<pre>
+
lcg-cp lfn:testfile1 ./outputfile
+
</pre>
+
And delete the last one and the name entry in the LFC:
+
<pre>
+
lcg-del -a lfn:testfile1
+
$ lcg-lr lfn:testfile1
+
[LFC] prod-lfc-shared-central.cern.ch: /grid/dteam/joebloggs/testfile1:
+
No such file or directory
+
lcg_lr: No such file or directory
+
</pre>
+

Latest revision as of 11:08, 23 August 2019

Using the Grid

The information herein is summarized from the currently most comprehensive document for new Grid users, i.e. gLite User Guide. That document is long, but it's worth knowing what it contains. A link it can be found on this page: http://www.gridpp.ac.uk/deployment/users/

This is a crash course on using the grid. Hopefully, if you use this course, your jobs won't crash. I assume from the start that you already have a configured UI at your disposal, although we have a tutorial on that too, should you need it.


Getting a grid certificate

Getting a grid certificate is a multi-step process, and is run by the UK e-science Certificate Authority, which is part of the National Grid Service. To start the process, you need to choose a web browser (e.g. Firefox) that you will have consistent access to (you need to use the same system for both requesting your certificate and retrieving it when it's ready). Don't use a temporary login anywhere.

Using the browser visit https://ca.grid-support.ac.uk/ and select the 'Request a Certificate' option, then choose 'User Certificate', fill in your personal details (using your departmental email address) and select the appropriate registration authority (RA) for your site. Further instructions will then be emailed to you; once that has happened you should get a further email from the RA staff, and you'll then need to visit them in person, with some photo-id.

Converting your grid certificate

To convert the new certificate for use in grid jobs, use the openssl pkcs12 command to convert the certificate into the certificate and key files you will need to generate proxies for grid work.

$ cd ~/.globus
$ mv [your-cert-file] ./.
$ openssl pkcs12 -in [your-cert-file] -clcerts -nokeys -out usercert.pem
$ openssl pkcs12 -in [your-cert-file] -nocerts -out userkey.pem

...where [your-cert-file] is the name and path of your exported certificate file, and [path]/[filename].pem is the name and path for the certificate and key file to be generated in an existing directory. In response to each command, you will be prompted for your grid password.

You will then need to Change permissions to protect the converted key file:

$ chmod 600 userkey.pem

(Adapted from the instructions here - thanks!)

Joining a VO

Your grid certificate identifies you to the grid as an individual user, but it's not enough on its own to allow you to run jobs; you also need to join a Virtual Organisation (VO). These are essentially just user groups, typically one per experiment, and individual grid sites can choose to support (or not) work by users of a particular VO. Most sites support the four LHC VOs, fewer support the smaller experiments. The sign-up procedures vary from VO to VO, UK ones typically require a manual approval step, LHC ones require an active CERN account.

For anyone that's interested in using the grid, but is not working on an experiment with an existing VO, some sites have a local VO that can be used to get started.

Running jobs (this section is obsolete)

Please refer to the introduction to the GridPP DIRAC server instead.

Data management (this section is obsolete)

Data storage on the grid is based around Storage Element (SE) systems and DIRAC File Catalogue (DFC). The former store data, the latter keep track of the locations of the data. Please see the introduction to GridPP DIRAC for basic data management commands.