GridPP PMB Minutes 362 (12.10.09) ================================= Present: John Gordon (Chair), Sarah Pearce, Andrew Sansum, Dave Colling, Tony Doyle, Jeremy Coles, Steve Lloyd, Tony Cass, Robin Middleton, Pete Clarke (Suzanne Scott, Minutes) Apologies: David Britton, Roger Jones, David Kelsey, Glenn Patrick, Neil Geddes In attendance: Gareth Smith 1. Tier-1 Situation ==================== Gareth Smith provided an update: a) They were currently running through the disaster management system with progressive escalations and team discussions. b) The current situation was that all services were up except the 3D databases, including the LHCb LFC which was being worked on today. There were synchronisation issues. c) To recap: there had been multiple failures in the disk sub-systems supporting CASTOR, LFC and 3D databases. The databases were restored onto alternative hardware, which means that three different types of hardware are now supporting three different databases. None were still running on the kit that had the problems. CASTOR was restored at 5pm on Friday, but there were ongoing concerns re missing files. d) They were currently trying to understand the common cause of multiple failures. This could be an environmental cause. AS noted that it could also be electrical - they don't know yet. There were correlations with the electrical distribution. TD asked about monitoring information? AS advised yes, they had a system that gave logs and harmonic analysis, however they need better equipment but this was currently being used on Diamond. JG advised that they had only recovered because they still had the old hardware, which was redeployed. TD asked if there was a reason for slower speeds? JG said we had not been told about this. TD said the observation was circulated on the UK ATLAS Operations list. AS noted that it wasn't obvious that performance was poorer. JG noted that if it becomes apparent that this is the case, then we would publicise it. Gareth Smith noted that they were evaluating the current situation - it could take a day to see if a single change had made any difference, it could therefore be a long procedure to eliminate things: possibly 1- 2 weeks running on alternative hardware. GS asked if this was an acceptable level of risk? TD asked what the timescale was for better understanding the electricity supply? AS reported 1-2 weeks in the first instance to find correlations. JG asked what the options available to us were, if after a couple of weeks they still didn't know? TD suggested a move to the ATLAS building. Gareth Smith indicated that the state of power was cleaner in R89 than it was in the ATLAS centre. TD asked if there were anything to do with respect to users and lower performance over the next few weeks? JG advised that the redundancy was reduced. Gareth Smith left the meeting at this point. The following URL provides all the information about the outage: http://www.gridpp.rl.ac.uk/blog/2009/10/13/summary-of-recent-tier-1- outage/ 2. UK Pledges ============== JG reported that at the last MB meeting, Ian Bird had published the latest information on experiment requirements and received pledges. DB checked this, and the ATLAS total CPU requirement had gone down by 10% - there was a dispute between ATLAS and the Scrutiny Group, and ATLAS were advised to use the lower figure. This resulted in the UK pledging 14% of ATLAS requirements. How do we explain this to STFC? Tony Medland could reduce the pledge at the RRB, or before. JG noted that this was within our procurement fluctuations. 3. Accounting Changes ====================== SL reported that the meeting with JC and the T2 Co-ordinators had taken place, they had exchanged numbers and discussed them. There had been agreement on action, some figures were changed. Further iterations have now converged. He was awaiting final responses. The outcome was that this gives us correction to the accounting numbers that came of out APEL - these new numbers should be taken as correct and will be used to evaluate hardware. UCL and Cambridge were still outstanding, but this would not affect the figures overall (they have no log files). JG observed that the October accounts should tell us the increase and decrease and what they have delivered. SL advised that the ratio of resources going to the experiments was also a factor in future hardware funding. 4. All Hands Meeting ===================== SP reported that she had sent round an email asking people to advise her of All Hands papers. Thus far she noted papers or posters from DC, JC, Sam Skipsey, Stuart Purdie. She had received nothing further from anyone else. SP noted that the booth was there - it would be next to the NGS booth and the partition could be removed between them with an NGI space in the middle. JG asked about attendance? RM noted that he didn't usually receive notification of UK attendance in advance. SP advised that All Hands was £150 and IEEE £410 and the combined fee was £540. Students got in free. TD reported that David Forrest had submitted a paper on Neutrino Factory, and it had been accepted. RM advised that people needed to be organising, convening, or presenting a poster in order to attend. PC commented that it would be good to have a presence in relation to the e-Science reviewers. SP noted that after 15 October the price would go up. It was agreed that RM would check previous guidelines and wording in relation to attendance, and send round an email notifying people re attendance guidelines for AHM and IEEE. 5. AOB ======= None. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS provided the following abbreviated report: 1) The Tier-1 was down for most of last week after almost simultaneous failures on 4 separate RAID arrays supporting LFC, FTS, 3D and CASTOR. An environmental problem is suspected. The LFC/FTS service was restored in under 24 hours on alternate hardware, but CASTOR was down from Sunday to Friday until it too was restored on alternative hardware. The restoration of the 3D service to alternative hardware is ongoing. Investigations of the cause of the instability are ongoing and are likely to continue for several weeks. Planning is underway to consider our longer term strategy to move back to a more resilient configuration. 2) The disk tender has been delayed by 6 weeks and is now expected to deliver at the end of January. Problems in the original tender evaluation led to a requirement for further clarification. SI-2 ATLAS weekly review & plans --------------------------------- RJ was absent. SI-3 CMS weekly review & plans ------------------------------- DC reported that the October exercise was ongoing, it had been useful so far, availability of sites passing the SAM tests had come down across the Tier-2 but the UK were doing quite well. SI-4 LHCb weekly review & plans -------------------------------- GP was absent. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) Most recent operational change requests have come from ATLAS. A new ATLASHOTDISK space token has been introduced to help with the hot file issues mentioned over recent months. Most sites are reacting quickly on this request. Access to conditions data at Tier-2s has suddenly come up as requiring attention – the request is for Tier-2s to deploy and enable a squid cache. So far only a few sites have enabled this cache (it is being tracked here: https://twiki.cern.ch/twiki/bin/view/Atlas/T2SquidDeployment ). A frontier server will be placed at the Tier-1. 2) GridPP sites have been asked to check and verify data displayed in the new gstat2 interface: http://gstat-prod.cern.ch/gstat/. Feedback so far suggests reasonable accuracy for the storage figures; for the CPU some corrections are still required for the logical CPU numbers (to a first approximation this represents the number of job slots). 3) There has been another GridPP site impacted by the long running security incident. It was noted that this related to the extended SSH root problem from previously, that had affected HEP and HPC sites. 4) EGEE has requested further volunteer sites to help test the new authorisation framework ARGUS. While important GridPP sites are currently focussing on the SL5 migration and preparations for data taking. This would be discussed at the GDB. SI-6 LCG Management Board Report --------------------------------- JG reported that the main issues discussed had related to pledges, and quarterly reports from ALICE. ALICE reported that they could not yet access SL5 at RAL via Cream, however JG had ascertained that this was incorrect. The experiments (except for CMS to date) have profiled their quarterly requirement for disk in 2010. SI-7 Dissemination Report -------------------------- SP noted noting major to report. Neasan O'Neill was at an EGEE Concertation Meeting this week; he had completed an article on gqsub, and was in the middle of doing an article on the R89 move. REVIEW OF ACTIONS ================= 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. ONGOING. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. ONGOING. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. ONGOING. 358.2 GP will talk to LHCb and see if they can progress the issue of CASTOR 2.1.8, and come back to us. We would require a strong plea from LHCb that they want this by December. DB would contact Raja Nandakumar. ONGOING. 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). ONGOING. --------------------------- 359.5 GS, RN, DC, LB (experiment reps) each to contact Neasan O'Neill, advising where, specifically, the GridPP website should point to for each of their experiments, in terms of user support information. SP to follow-up. 359.6 SP to ensure that Neasan O'Neill updates the GridPP website accordingly (once experiment reps have provided info as to where the GridPP website should point to for each of their experiments, in terms of user support information). ONGOING. 361.1 JC to speak to Winnie Lacesso (regarding the kernel updates) about removing Bristol's CE and disabling the site by Wednesday 14th October. DB to write formally if she felt this was required. Done, item closed. 361.2 DB to contact NG re EGI Global Tasks and inform him that there were no additional tasks being bid for by GridPP. Done, item closed. 361.3 JC and AS to check Tier-1 and Tier-2 gstat2 results (in relation to SL5 having been discussed at the GDB) - ONGOING. JC to devolve any action to the dTeam - Done, half-item closed. ACTIONS AS AT 12.10.09 ====================== 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. 358.2 GP will talk to LHCb and see if they can progress the issue of CASTOR 2.1.8, and come back to us. We would require a strong plea from LHCb that they want this by December. DB would contact Raja Nandakumar. JG would follow this up at CERN. 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- 359.5 Graeme Stewart, Lee Barnby (experiment reps) each to contact Neasan O'Neill, advising where, specifically, the GridPP website should point to for each of their experiments, in terms of user support information. (DC & RN had already done this). SP to follow-up. ONGOING. 359.6 SP to ensure that Neasan O'Neill updates the GridPP website accordingly (once experiment reps have provided info as to where the GridPP website should point to for each of their experiments, in terms of user support information). 361.3 JC and AS to check Tier-1 and Tier-2 gstat2 results (in relation to SL5 having been discussed at the GDB). The next PMB meeting would take place on Monday 19th October at 12:55 pm. JG put in his apologies.