n west (APC)
Last modified: Thu Nov 29 13:07:44 GMT 2007

MINOS input to UB Meeting on 04 Dec 2007

Prologue

This UB meeting represents a watershed for MINOS. In the past, problems with the GRID infrastructure have not impacted us greatly and we have continued to use qsub for all our production work. We have invested a lot of effort trying to prepare, but still have serious concerns that the migration to the GRID, although inevitable, will badly effect us in the short term . Consequently I am reiterating a plea, that I have made before, that someone within GridPP be given the job of helping small VOs like MINOS as a primary responsibility. Ideally this would be permanent arrangement, but if that is not possible, at least for the next 3 to 6 months while we transition. I expand this request in the section Request for Small VO Support

Contents

Update on experiment requirements for 2008

Alex Sousa should have returned (or shortly will) a spreadsheet requesting
    * CPU: 125 kSpecInt2k
    * Disk: NFS 6.9TB 
    * Disk: dCache 2.0TB (migrating to Castor
    * Tape: 10 TB (dCache migrating to Castor)

Progress on Castor as Grid storage solution

We are only just starting to get our Castor allocation, our disk servers where given to LHCb some months ago, so have nothing to report.

Progress on testing Grid User Interfaces on other computing clusters beside Tier 1

On the very narrow question on whether we can submit jobs from other UIs, we have done a series of test runs both at Oxford and RAL PPD and find no problems. On the broader question on whether we are happy to given up the RAL UI we do have concerns and request retention of the UI, or equivalent at RAL as part of our Special cases for non-Grid access.

Progress on Grid submission instead of qsub

We have yet to run any full length production jobs on the GRID; despite having a long lived MyProxy server running all our jobs fail as soon as the short term proxy expires! As the grid500M queue has typically a day's worth of jobs, often our jobs don't even start to run. Derek Ross is investigating, but this means we don't have progress to report.

Also the grid500M appears very heavily used, based on EstRespTime, the grid1000M and grid2000M appear to have much better times but we cannot use them. Some of our analysis jobs now require 1GB of memory so need grid1000M but beyond that our main effort, MC, only needs grid500M but is starved of CPU. Can anything be done to even this up?

On a related point we don't know how to debug when jobs exceed some limit and are aborted, returning no output at all. What do other experiments do? Is there anyway to access the output of jobs running on the GRID similar to qcat that could help us understand what is going on? I have spoken to Stephen Burke and it looks like we have to wait for Glite/WMS for any improvement. I have just heard that Catalin have finished setting this up so there may be a partial solution soon.

Special cases for non-Grid access

The Summary

This section is rather long and there will not be time at the meeting to discuss in detail. So, to summarise:-

  1. Request access for VO management, a need which I think has been recognised and accepted.

  2. A UI for group wide batch submission both because production management roles rotate between universities so centralising on RAL is logical and also because most of our data is on NFS disk making management from any other UI very clumsy.

  3. A short term request for existing users (<= 20) access to RAL until we no longer have most of our data on NFS disk. Continuing to use the UI there for private batch submission would be nice but not essential.
Matt Hodges suggested that it might be possible to set up a "VO box" to be shared by small VOs like ours. MINOS we be very interested in such a service. Indeed if it were at RAL and it had access to our NFS disks then we could probably withdraw all of the requests in this section.

The Detail

PREAMBLE: Our Data Access Strategy

Before I get to the requests I think it would help if I were briefly to explain our data access strategy. At the moment our data is distributed over a heterogeneous mixed of devices:-
1)  FNAL enstore
2)  Local NFS disk
3)  RAL dCache
4)  RAL Castor (currently not available)
To further complicate things some of our datasets are defined in terms of SAM queries on the FNAL data store although the data themselves are often on local disk or SE. To isolate production from details about physical location I have, over the years, developed DCM (Data Cache Management) that presents all the data in the same way with catalogues that map user requests from file name to file locations. A data driven back-end then selects the appropriate commands (currently cp, wget, dccp, rfio and soon to be added lcg-utils). To obtain DCM catalogues of the SEs at RAL I requested, and was given, permission to perform nightly "sympathetic" scans of our entire data set.

In a sense DCM is our equivalent of LFC and lcg-utils but in the short to medium term, with our substantial NFS disk allocation and our reliance on FNAL for data and meta-data, LFC cannot be a complete solution for us.

Right, now on to the requests for resources.

Request 1: VO Support

Now I believe that the case has already been accepted for interactive access to maintain VOs but I will formalise it by listing the types of activity for which we still need interactive access:-

  1. Database Management
    We run a database distribution system and nightly update our databases on the MySQL server sql.gridpp. Mostly this runs as a cron job although I do occasionally require to log in to investigate problems.

  2. Catalogue Generation
    As explained in the preamble, I have a nightly cron that forms catalogues of our SE. Again I may need interactive access to fix problems.

  3. General trouble shooting
    Hard to enumerate exactly what reasons I would need access, but given that most of our local storage is NFS disk, interactive access is the only sensible way to manage it.
I think it wise to allow one other person access so that we don't have a single point failure and would request to permanently have:-
1)  2 accounts for VO managers
2)   ~ 50GB disk space
3)  < 1 cpu
4)  Interactive and cron login

Request 2: UI for collaboration wide Monte Carlo

The UK group perform a very valuable service to the collaboration as a whole, providing approximately 50% of all the Monte Carlo. It makes sense to concentrate all of this effort at a single UI. We would request that this UI be at RAL for the following reasons:-

  1. Currently most of the data is on NFS disk so sorting out problems without interactive access would be very hard. Although this does not require a UI it makes sense to consolidate operations of job submission and data management at a single place.

  2. Even when our NFS disk use has declined and our local data is in SEs we would like to retain our UI for collaboration wide work at RAL as this simplifies the rotation of our Production Manager. As I write this, Mike Kordosky (UCL) is stepping down from the role and Marta Tavera (Sussex) is taking over. At Sussex there isn't a UI and apparently little interest in hosting one. Further, Alex Sousa (Oxford) is taking over Mike's role of Physicist representative. Naturally the two roles are closely coupled and he will need access to the UI use for production work. As both roles rotate it would make our lives far simpler if we could remain at RAL rather than pick some UI at a university.
So our requests, again building in redundancy, are:-
1)  2 accounts for Production Managers
2)   ~ 50GB disk space
3)  < 1 cpu
For both VO management and Production Managers we would actually prefer single accounts each with two SSH keys, or possibly even one account with four SSH keys as then we won't run into file permission problems and the like. I have already proposed this in a slightly extended form and got back the very clear message that this was considered a security risk but, as there are good management reasons for wanting this, I really would like to understand why two accounts each holding one SSH key is considered safer than one account holding two, if anything I would have thought that the later was marginally better.

Request 3: UI for private analysis

MINOS fully accept in the long term there is no case to be made for requesting the RAL UI for private work carried out by UK members. At Oxford we are currently running tests and though we have some problems, they are not related to the physical location of the UI. However, in the short term, while our data remains on NFS disk, data management for private work is far simpler if interactive access is permitted. Here our request is:-
1) Existing MINOS UK account holders be permitted to retain them until
our data has moved into Castor.
In total this is <= 20 accounts.

As a convenience it would be simpler for us if we could continue to use the UI at RAL for these users, but that is not critical.

Request for Small VO Support

As a small VO MINOS has, not too surprisingly, not been top of the service list. Here is a list of problems/requests and response times in the past 6 months. Now I am not claiming that all these are show stoppers, they aren't. Nor am I claiming that people aren't doing their jobs, I know they are and when they can spare the time I have had lots of very helpful email exchanges. I am sure that it is simply that with the LHC coming on-stream and with so much going on, its just too easy to overlook a small VO. Until now that hasn't really mattered but from now on it will.

If there were a single contact in GridPP for small VOs (i.e those who have never had GridPP funded posts) who could give general advice and help sort out problems as and when they arise I believe it would be of great benefit.

Migration of data from ADS Tape Store

I believe that about a year ago an offer was made to help people migrate data from the old ADS Tape store. MINOS have about 4TB (1 TB MINOS + 3TB Soudan2) that they would like to migrate, either to dCache, or given that this service will close soon, to Castor. What support is there for this?