GridPP PMB Minutes 363 (19.10.09) ================================= Present: David Britton (Chair), Sarah Pearce, Andrew Sansum, Tony Doyle, Jeremy Coles, Steve Lloyd, Tony Cass, Robin Middleton, Pete Clarke, John Gordon, Roger Jones. Apologies: Dave Colling, Tony Cass, David Kelsey, Glenn Patrick, Neil Geddes 1. Status of CASTOR Data ========================== Data written between the failure of the first RAID array on Sep 24th and the failure of the second RAID array on Oct 4th, had been lost. The root cause was not obvious, although it was clear that the database had been restored to the wrong date and then files had been wiped by the synchronization process. Investigations were trying to establish whether this was an error, an oversight, or a more fundamental problem within ORACLE or the databases. Initial reviews of the recovery process had not shown up any obvious mistakes though it is clear that the restore date should have been checked before completion. Indeed, at the time it had been noted that there were surprisingly few problems, which is now explained by the earlier restore date. TD asked whether there was adequate internal documentation? AS noted that it was more a case of inadequate breakpoints and checkpoints, although considerable time had been spent trying to do things carefully. CERN had been consulted during the process, for example. DB noted that this was an extremely serious situation; immediate response was to set up a separate track of disaster management (first meeting Thursday); secondary response was to initiate a full review of the Tier-1 with a focus (again) on CASTOR/DB issues, for December. We are on thin-ice at the moment with sub-optimal hardware replacing the suspect EMC kit, and a restore process that is flawed or not understood. It is essential that we ensure the integrity of our restore process with documentation, breakpoints, cross-checks, clarity of decision making responsibilities, and regular verifications. 2. Status of CASTOR Hardware ============================== Investigations were continuing on the problems observed with the EMC hardware that previously underlay the CASTOR, and other, databases. Suspicions were focusing on electrical supply issues because changing the supply route removed the errors. However, measurements indicate that all the supply is within mains specifications. A definitive test was being established. Next disaster management meeting on this thread is a week away; at that point decisions are needed on long-term hardware for CASTOR. 3. VOMS Transition Plan ====================== The VOMS service transitions from GridPP to NGS over the next 6 months. SP forwarded some information from Manchester about this which raised questions about memory leaks in the latest VOMS release and the machines on which they run at Manchester; and the route for users to submit tickets and the documentation of this process. ACTION 363.1 SP to follow up on VOMS transition. ACTION 363.2 JC to verify help process for VOMS. 4. Dissemination in EGI/CUE/ROSCOE =================================== SP noted 1.5 FTE in Amsterdam for dissemination in EGI and the CUE SSC bid for 4.75 FTE for dissemination in the UK, lead by QMUL (complementing Training at Edinburgh). It was unclear to DB and SP how dissemination within ROSCOE would work; email from Cal had not really clarified. DB and SP felt there was no funded effort within ROSCOE but the UK could contribute via CUE if some of those posts were funded. For now, a watching brief would be maintained. 5. Weekly Notes =============== DB noted that AHM abstract 73 had been recommended by both reviews for a talk but had only been accepted as a poster, despite a last minute promotion of a number of posters to talks to fill the available slots. DB asked whether people felt they accrued any benefits from posters in the perspective of the amount of work involved? SP noted that Neasan could create the poster from the information compiled (the abstract was paper-like in length!). ACTION 363.4: DB to contact AHM to request the appropriate feed back on abstract 73. Tier-2 allocations: SL provided an update on the process of determining the Tier- 2 hardware allocations. Some input numbers may still be missing, such as future hardware costs and revised experiment requirements, but since this was a cash- limited exercise, it would not change the final outcome too much. SL would circulate to Deployment Board; DB would look at input numbers. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS provided the following report: Fabric ====== 1) Lot 2 of disk servers have failed acceptance. We are working with the supplier to identify the cause. Its looking increasingly unlikely that we will have a new solution on the ground by early November (30% likily) which is our latest delivery date in order to re-certify and achieve deployment by Christmas. (Will email the PMB seperatly with more details). 2)New procurements have started. - Disk ITT has closed and evaluation will commence this week. Delivery target, December and April. We needed to issue a clarification and a request to retender. This has led to 6 weeks delay in the tender. Delivery now expected end of January - CPU PQQ has been evaluated. The invitation to tender has been issued on schedule. Delivery target February. 3)We have ordered 9 T10KB drives. We aim to move CMS to T10KB by December. 4) Procurement is underway for an additional 4*1Gb/s second OPN link to CERN as resiliant backup. Staffing ======== Nothing to report - will drop from the report next week (all being well). Service ======= 1) SAM availability for the OPS VO was 84%. Weekly production report is at: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Experiments_Liaison_Meeting_Oper ations_Reports 2) CASTOR a) Following the restoration of the CASTOR service: see: http://www.gridpp.rl.ac.uk/blog/2009/10/13/summary-of-recent-tier-1-outa ge/ we received reports of missing files. By Wednesday 14th October it became apparent that files created between 24th September and 4th October were lost. Investigations of the cause are ongoing, but it appears that on old version of the database was either incorrectly restored or became live. Initial internal review of our restore activities has shown nothing wrong with our process and we are working with Oracle to understand how this could have happened. The correct version of the d/b has subsequently been restored (using a different tecgnique) and file names of missing files have been retrieved, however the data itself cannot be recovered. a) We continue to have problems with the original CASTOR RAID arrays and CASTOr (also LFC/FTS) is currently running on alternative hardware. We are carrying out a number of systematic changes in order to isolate the cause of the fault - currently believed to be environmental in nature. We are also working with the hardware supplier. SI-2 ATLAS weekly review & plans --------------------------------- RJ was recovering from downtime. SI-3 CMS weekly review & plans ------------------------------- DC was absent. SI-4 LHCb weekly review & plans -------------------------------- GP was absent. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) At the GDB last week (http://indico.cern.ch/conferenceDisplay.py?confId=45480) it was confirmed that SCAS is still the authorisation service that sites should deploy. ATLAS has indicated that a move to multi-user pilot jobs will be required soon. Direction on this is expected from the MB soon. To be ready each GridPP Tier-2 is now setting up an instance for testing. In parallel one site will also look at ARGUS (the longer term solution for an authorisation service). The assumption is that the PMB will support an MB decision that sites should now enable experiments to use multi- user pilots. 2) Last week we saw several problems with the camont n-gram work Ð all of which seem to be resolved. Firstly there was an http request sent to a subscription service and this pointed to an error in the way the robot.txt file was parsed. This has led to a temporary banning of the user DN and VO at the submitting site (site policy while situation followed up). One external (private company) website administrator was curious about the probing of its site and asked for clarification Ð apparently the spider header (describes why the spider is being used) was not populated. Finally, camont tried to improve the efficiency of jobs by performing DNS lookups but quickly realised that this led to concern at a few sites due to the DNS loads and they removed the extra lookup. 3) There is increasing concern about the lack of an adequate disk pool drain function in DPM. This means that it can take weeks/months to clear a disk server for maintenance purposes (e.g. to change partition sizes). This is just one area that is leading to questions about the future support for DPM. GridPP now has >15 sites using DPM and yet support is not clearly defined. Should this concern be escalated to the MB? 4) There is a request that all WMS nodes be upgraded to gLite 3.2 by the end of October. Those at IC and RAL are now upgraded. One at Glasgow is showing problems and so the other is remaining on 3.1 until stability has been achieved for the upgraded one. gLite 3.2 is needed for the switch of the CREAM-CE to production. For reference: A) The EGEE PPS is now moving to what is being called Òstaged rolloutÓ. Details of the release procedure are here https://twiki.cern.ch/twiki/bin/view/EGEE/ReleaseProcedure. B) There is an SA1 Coordination Meeting tomorrow. To see the current topics for this group see http://indico.cern.ch/conferenceDisplay.py?confId=71024. C) Currently Glasgow and RAL Tier-1 have CREAM CEs running. At the GDB last week it was discussed about moving the CE tag from ÒspecialÓ to ÒproductionÓ. ALICE is keen for this move and it would help further the testing for the other experiments. Within GridPP we now plan to setup CREAM so that there is at least one instance for each Tier-2. D) At the Technical Management Board (TMB) last week discussion started about the timeline for gLite4: https://twiki.cern.ch/twiki/bin/view/EGEE/GLite4Planning. It will be available on ÒSL5/x86_64, Debian 5 and other platformsÓ. E) Not all sites patched their worker nodes by the deadline last week. As agreed by the EGEE PMB, sites that remained unpatched have been suspended. SI-6 LCG Management Board Report --------------------------------- There was no MB Report this week. SI-7 Dissemination Report -------------------------- SP had covered under item-4 REVIEW OF ACTIONS ================= No time for review this week. ACTIONS AS AT 12.10.09 ====================== 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. 358.2 GP will talk to LHCb and see if they can progress the issue of CASTOR 2.1.8, and come back to us. We would require a strong plea from LHCb that they want this by December. DB would contact Raja Nandakumar. JG would follow this up at CERN. 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- 359.5 Graeme Stewart, Lee Barnby (experiment reps) each to contact Neasan O'Neill, advising where, specifically, the GridPP website should point to for each of their experiments, in terms of user support information. (DC & RN had already done this). SP to follow-up. ONGOING. 359.6 SP to ensure that Neasan O'Neill updates the GridPP website accordingly (once experiment reps have provided info as to where the GridPP website should point to for each of their experiments, in terms of user support information). 361.3 JC and AS to check Tier-1 and Tier-2 gstat2 results (in relation to SL5 having been discussed at the GDB). The next PMB meeting would take place on Monday 19th October at 12:55 pm. JG put in his apologies.