Plan NGS Under EDGRB

From GridPP Wiki
Jump to: navigation, search

Introduction

The aim is to allow submission to NGS sites via the EDG RB. Below is a list of steps that should help this happen. What follows is the result of meeting designed to create a list of steps to get things going. We will start with submitting jobs though the RAL T1 RB to the [RAL NGS] cluster.

Get Authorised

Completed

Submit A Job With No Matchmaking

  • Kevin log into lcgui01.gridpp.rl.ac.uk
  • Kevin create a JDL file helloworld.jdl
 Executable = "/bin/hostname";
 #Arguments = "1m";
 StdOutput = "hello.out";
 StdError = "hello.err";
 #InputSandbox = {"/usr/lib/mozilla-1.0.2/mozilla-bin"} ;
 OutputSandbox = {"hello.out",  "hello.err"};
  • Kevin submit the job
 $ edg-job-submit --vo dteam -r grid-data.rl.ac.uk/jobmanager-pbs-cpu1 \
        helloworld.jdl
 https://lcgrb01.gridpp.rl.ac.uk:9000/ldNsk7qKq-Dfkwtw_MTxYg
  • Kevin check the status and fetch the output.
 $ edg-job-status https://lcgrb01.gridpp.rl.ac.uk:9000/ldNsk7qKq-Dfkwtw_MTxYg
 $ edg-job-get-output https://lcgrb01.gridpp.rl.ac.uk:9000/ldNsk7qKq-Dfkwtw_MTxYg

Completed

Configure UI for NGS VO

Not really needed but easier if Steve creates the default files for the NGS VO on lcgui01.gridpp.rl.ac.uk. It just makes the command lines a lot shorter.

Completed Steve traylen 13:11, 31 Jan 2006 (GMT)

Check GIIS/GRIS endpoint for RAL NGS node

The contact string given for the RAL site was

 ldap://ngsinfo.grid-support.ac.uk:2135/Mds-Host-hn=grid-data.rl.ac.uk,mds-vo-name=ngsinfo,o=grid

when looking at this it became aparent that the NGS BDII is looking at the CE GRISes and not at site GIISes. The consequence of this is that it becomes impossible to ldap query just one site's information. This is a problem since it stops as adding one site at a time. This is being looked into now, progress allready made. Just some changes on the BDII needed now I expect. This needs to fixed before the site can be added to the GOCDB.

The changes have now been made and we have

 ldapsearch -x -H ldap://ngsinfo.grid-support.ac.uk:2135 \
   -b 'mds-vo-name=ral,Mds-vo-name=ngsinfo,o=grid'

displaying just information about RAL.

Completed Steve traylen 14:57, 31 Jan 2006 (GMT)

Add RAL NGS Node to GOCDB

Once we can ldap query just one site in the NGS BDII we can add this site to the GOCDB as an uncertified site. From this we will get gstat to sanity check the information being published. This will be a big help before we move onto the next steps.

Now the there is an entry in the GOCDB but it is yet to show in the gstat test page. Hopefully it should just turn up...

The RAL NGS node now has it's own gstat page. This has shown some problems with information system now some of which need fixing.

Completed Steve traylen 17:15, 31 Jan 2006 (GMT)

Add RAL NGS Node to T1 Top BDII

Fix the RAL Tier1 BDII so that it includes the RAL NGS CE.

   $ ldapsearch -x -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 \
             -b 'Mds-vo-name=RAL,Mds-vo-name=ngsinfo,Mds-vo-name=local,o=Grid' 

now shows results as it should, this is the BDII the T1 RB is configured to use.

Completed Steve traylen 15:46, 1 Feb 2006 (GMT)

This has been done in a dirty way at the moment and will be lost with the BDII upgrade due on the 6th February. Easy to add back again. It will also change once RAL becomes RAL-NGS as it should be, see below. As for how to make its addition non dirty this is added to the discussion section at the end.

Problems With Published Information

These are the problems identified with the information being published which need to be corrected. Some of these are shown on the RAL NGS gstat page. We will concentrate on the essential ones first and look at the cosmetic ones later.

SubClusters

There are no subclusters being published.

SiteName

Tradition states that the GOCDB entry and Mds-vo-name of site should be the same. Basically the site name should be unique and consistant in as many places as possible.

The search string should become

  ldapsearch -x -H ldap://ngsinfo.grid-support.ac.uk:2135 \
       -b mds-vo-name=RAL-MDS,Mds-vo-name=ngsinfo,o=grid

currently it is just mds-vo-name=ral. This makes sense considering the existing Tier1 site and Tier2 site at RAL. This requires changes to

  • Site GIIS name on NGS RAL CE.
  • Registration of existing CE GRIS to the new CE GIIS name.
  • Change of the NGS BDII config file to pick up the new site parameter.

Submit a Job With Matchmaking

Try the above JDL file and try to match it against resources

 $ edg-job-list-match --vo ngs helloworld.jdl

It works!!!!

 Selected Virtual Organisation name (from --vo option): ngs
 Connecting to host lcgrb01.gridpp.rl.ac.uk, port 7772
 
 ***************************************************************************
                        COMPUTING ELEMENT IDs LIST
  The following CE(s) matching your job requirements have been found:
  
                   *CEId*
  grid-data.rl.ac.uk:2119/jobmanager-pbs-router
 ***************************************************************************

Try submitting it job.

 $ edg-job-submit --vo ngs helloworld.jdl

It works !!!!

 *************************************************************
 BOOKKEEPING INFORMATION:
 
 Status info for the Job : https://lcgrb01.gridpp.rl.ac.uk:9000/mUXSJNtIBiOfIyMG9BstLg
 Current Status:     Done (Success)
 Exit code:          0
 Status Reason:      Job terminated successfully
 Destination:        grid-data.rl.ac.uk:2119/jobmanager-pbs-router
 reached on:         Wed Feb  1 16:11:18 2006
 *************************************************************

I am surprised it does works no subclusters. With no subcluster present if you try matching on anything that would normally be in a sublcuster like a memory requirment then it fails.

Multiple CPU Jobs and Routing Queues

This can now be looked into, the important extra lines for the JDL are

 JobType = "MPICH";
 NodeNumber = 10;

This is causing problems, it appears that when the RB is given an MPICH job is submits a globus RSL 'jobtype=single' containing an MPI launch rather than sending a job as 'jobtype=mpi'. There is no way to submit a jobtype=mpi with the RB. This does then rather put MPI submissions on hold for the time being.

Check A Second Site

All of the above should now be tried with a second site within NGS. Choose the site where it is easiest to get the above working.

Enable all NGS Users to Submit to Two Sites

  • Steve enable the NGS VO on the RAL Tier1 RB, this is easy but will take a few days.
  • Kevin install the EDG edg-job-* commands somewhere that NGS users have access to.
  • Check a second fresh user can do something.

Review

At this point a review will be needed. Things that need discussing:

  • Will NGS sites ever be equivlent in EGEE to say Durham?
    • Can only be done if NGS sites are certified by UK/I ROC.
    • Requires monitoring from central CICs and so these users must be authorised. Recently the ops VO has been set up by the CIC to reduce the number of needed users massivly from the dteam VO.