Plan NGS Under EDGRB
Contents
- 1 Introduction
- 2 Get Authorised
- 3 Submit A Job With No Matchmaking
- 4 Configure UI for NGS VO
- 5 Check GIIS/GRIS endpoint for RAL NGS node
- 6 Add RAL NGS Node to GOCDB
- 7 Add RAL NGS Node to T1 Top BDII
- 8 Problems With Published Information
- 9 Submit a Job With Matchmaking
- 10 Multiple CPU Jobs and Routing Queues
- 11 Check A Second Site
- 12 Enable all NGS Users to Submit to Two Sites
- 13 Review
Introduction
The aim is to allow submission to NGS sites via the EDG RB. Below is a list of steps that should help this happen. What follows is the result of meeting designed to create a list of steps to get things going. We will start with submitting jobs though the RAL T1 RB to the [RAL NGS] cluster.
Get Authorised
- We need to have Kevin authorised everywhere.
- Kevin send Steve his DN.
- Add Kevin's DN localy to lcgrb01.gridpp.rl.ac.uk.
- Kevin [request an account] on a RAL user interface.
Completed
Submit A Job With No Matchmaking
- Kevin log into lcgui01.gridpp.rl.ac.uk
- Kevin create a JDL file helloworld.jdl
Executable = "/bin/hostname"; #Arguments = "1m"; StdOutput = "hello.out"; StdError = "hello.err"; #InputSandbox = {"/usr/lib/mozilla-1.0.2/mozilla-bin"} ; OutputSandbox = {"hello.out", "hello.err"};
- Kevin submit the job
$ edg-job-submit --vo dteam -r grid-data.rl.ac.uk/jobmanager-pbs-cpu1 \ helloworld.jdl https://lcgrb01.gridpp.rl.ac.uk:9000/ldNsk7qKq-Dfkwtw_MTxYg
- Kevin check the status and fetch the output.
$ edg-job-status https://lcgrb01.gridpp.rl.ac.uk:9000/ldNsk7qKq-Dfkwtw_MTxYg $ edg-job-get-output https://lcgrb01.gridpp.rl.ac.uk:9000/ldNsk7qKq-Dfkwtw_MTxYg
Completed
Configure UI for NGS VO
Not really needed but easier if Steve creates the default files for the NGS VO on lcgui01.gridpp.rl.ac.uk. It just makes the command lines a lot shorter.
Completed Steve traylen 13:11, 31 Jan 2006 (GMT)
Check GIIS/GRIS endpoint for RAL NGS node
The contact string given for the RAL site was
ldap://ngsinfo.grid-support.ac.uk:2135/Mds-Host-hn=grid-data.rl.ac.uk,mds-vo-name=ngsinfo,o=grid
when looking at this it became aparent that the NGS BDII is looking at the CE GRISes and not at site GIISes. The consequence of this is that it becomes impossible to ldap query just one site's information. This is a problem since it stops as adding one site at a time. This is being looked into now, progress allready made. Just some changes on the BDII needed now I expect. This needs to fixed before the site can be added to the GOCDB.
The changes have now been made and we have
ldapsearch -x -H ldap://ngsinfo.grid-support.ac.uk:2135 \ -b 'mds-vo-name=ral,Mds-vo-name=ngsinfo,o=grid'
displaying just information about RAL.
Completed Steve traylen 14:57, 31 Jan 2006 (GMT)
Add RAL NGS Node to GOCDB
Once we can ldap query just one site in the NGS BDII we can add this site to the GOCDB as an uncertified site. From this we will get gstat to sanity check the information being published. This will be a big help before we move onto the next steps.
Now the there is an entry in the GOCDB but it is yet to show in the gstat test page. Hopefully it should just turn up...
The RAL NGS node now has it's own gstat page. This has shown some problems with information system now some of which need fixing.
Completed Steve traylen 17:15, 31 Jan 2006 (GMT)
Add RAL NGS Node to T1 Top BDII
Fix the RAL Tier1 BDII so that it includes the RAL NGS CE.
$ ldapsearch -x -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 \ -b 'Mds-vo-name=RAL,Mds-vo-name=ngsinfo,Mds-vo-name=local,o=Grid'
now shows results as it should, this is the BDII the T1 RB is configured to use.
Completed Steve traylen 15:46, 1 Feb 2006 (GMT)
This has been done in a dirty way at the moment and will be lost with the BDII upgrade due on the 6th February. Easy to add back again. It will also change once RAL becomes RAL-NGS as it should be, see below. As for how to make its addition non dirty this is added to the discussion section at the end.
Problems With Published Information
These are the problems identified with the information being published which need to be corrected. Some of these are shown on the RAL NGS gstat page. We will concentrate on the essential ones first and look at the cosmetic ones later.
SubClusters
There are no subclusters being published.
SiteName
Tradition states that the GOCDB entry and Mds-vo-name of site should be the same. Basically the site name should be unique and consistant in as many places as possible.
The search string should become
ldapsearch -x -H ldap://ngsinfo.grid-support.ac.uk:2135 \ -b mds-vo-name=RAL-MDS,Mds-vo-name=ngsinfo,o=grid
currently it is just mds-vo-name=ral. This makes sense considering the existing Tier1 site and Tier2 site at RAL. This requires changes to
- Site GIIS name on NGS RAL CE.
- Registration of existing CE GRIS to the new CE GIIS name.
- Change of the NGS BDII config file to pick up the new site parameter.
Submit a Job With Matchmaking
Try the above JDL file and try to match it against resources
$ edg-job-list-match --vo ngs helloworld.jdl
It works!!!!
Selected Virtual Organisation name (from --vo option): ngs Connecting to host lcgrb01.gridpp.rl.ac.uk, port 7772 *************************************************************************** COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found: *CEId* grid-data.rl.ac.uk:2119/jobmanager-pbs-router ***************************************************************************
Try submitting it job.
$ edg-job-submit --vo ngs helloworld.jdl
It works !!!!
************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://lcgrb01.gridpp.rl.ac.uk:9000/mUXSJNtIBiOfIyMG9BstLg Current Status: Done (Success) Exit code: 0 Status Reason: Job terminated successfully Destination: grid-data.rl.ac.uk:2119/jobmanager-pbs-router reached on: Wed Feb 1 16:11:18 2006 *************************************************************
I am surprised it does works no subclusters. With no subcluster present if
you try matching on anything that would normally be in a sublcuster like a
memory requirment then it fails.
Multiple CPU Jobs and Routing Queues
This can now be looked into, the important extra lines for the JDL are
JobType = "MPICH"; NodeNumber = 10;
This is causing problems, it appears that when the RB is given an MPICH job is submits a globus RSL 'jobtype=single' containing an MPI launch rather than sending a job as 'jobtype=mpi'. There is no way to submit a jobtype=mpi with the RB. This does then rather put MPI submissions on hold for the time being.
Check A Second Site
All of the above should now be tried with a second site within NGS. Choose the site where it is easiest to get the above working.
Enable all NGS Users to Submit to Two Sites
- Steve enable the NGS VO on the RAL Tier1 RB, this is easy but will take a few days.
- Kevin install the EDG edg-job-* commands somewhere that NGS users have access to.
- Check a second fresh user can do something.
Review
At this point a review will be needed. Things that need discussing:
- Will NGS sites ever be equivlent in EGEE to say Durham?
- Can only be done if NGS sites are certified by UK/I ROC.
- Requires monitoring from central CICs and so these users must be authorised. Recently the ops VO has been set up by the CIC to reduce the number of needed users massivly from the dteam VO.