GridPP PMB Minutes 364 (26.10.09) ================================= Present: David Britton (Chair), Sarah Pearce, Andrew Sansum, Dave Colling, Tony Doyle, Jeremy Coles, Pete Clarke, Steve Lloyd, (Suzanne Scott, Minutes) Apologies: Roger Jones, David Kelsey, Tony Cass, Robin Middleton, John Gordon, Glenn Patrick, Neil Geddes 1. Tier-2 Hardware Allocations =============================== DB reported that SL had circulated a spreadsheet. SL reported that over the last 3-4 weeks himself and the Technical Co-ordinators had agreed the conversion of the accounting numbers from APEL into HEPSPEC. SL advised that he had followed-through the procedure and anomalies were found. Sites with multiple experiments were getting more and it was decided to normalise the experiment- institute matrix - the previous 1s and 0s for the institutes were replaced by fractions and normalised. The matrix didn't map directly. The finalised version had been sent to DB and the experiments following this exercise. Discussion had taken place but no concrete proposals had been forthcoming. There was a question as to whether CMS wanted their Tier-3 institutes included, but probably not. ATLAS were discussing disk resources at sites rather than spreading them around - sites had been allocated as Monte Carlo or analysis. DB had provided new costings and new global requirements. SL advised that the numbers would change slightly but would not make a huge difference. DB asked where would this go next - to the CB? SL advised that we needed to finalise the figures thus far - if the PMB agreed the approach then he would update the requirement numbers and prices, get the bottom line, then send this to the CB. DB noted that a replacement table was needed for the Appendix to the MoU. AS confirmed he would send info shortly on achievable price for disk. There followed a discussion on the figures and columns etc within SL's tables. It was agreed to proceed to finalising a suitable table, get it approved by the PMB, and then it should be sent on to the CB. Experiment reps had to confirm that they didn't want any changes. 2. Tier-1 Review ================= DB advised that for the last two years we have reviewed the Tier-1 in lieu of the Tier-1 Board. Another review was now required. Previously this had been raised and agreed in principle with RAL, but due to recent events this review has been brought forward to December 14th. The membership for the review would probably comprise Jamie Spears; hopefully TC from CERN could attend; and hopefully Maria Girone (for database & 3D input). DB would ask her. TC had suggested Michael Ernst and Bert Panzer (as last year) who would have a high- level view which would complement the technical review. TD commented that Michael Ernst would be very useful. DB would investigate whether he could come. In relation to GridPP, DB noted that it was open to those in the PMB who wished to go. PC and GP had both confirmed. SP could join remotely. SL, DC & TD confirmed they could attend. AS advised that he had already contacted the RAL Team Leaders and they should be able to attend; he had also booked a meeting room. DB also reminded that there was a PMB F2F happening at Imperial College on 10th December. AS asked what the Tier-1 Review should cover? DB noted it was important not to 're-hash' the current disaster management plan - this should be covered within 'scene-setting' only. Main issues should be: 1. protection put in place to ensure safety of custodial data; 2. CASTOR/database communications/interface need to be evaluated, going forward, to ensure they are currently, and will be, fit-for-purpose in the run-up to data-taking; 3. examination of any other high-level areas like resilience and service issues, plus failover plans It was important that the day did not become un-focussed - this was a forward look to key issues in the run-up to data-taking. DB noted the priorities as: 1. not losing data; 2. continuity of service however the exact format was still to be discussed. The presentations should not be retrospective. 3. GridPP4 Preparations ======================== DB advised that he had forwarded an email from Malcolm Booy - the PPRP will consider the GridPP4 proposal on June 30th/July 1st next year; submission of the proposal would be on 28th April. DB advised that there were OC meetings beforehand; the PMB F2F meeting at Imperial was intended to kick-off GridPP4. DB noted that he would speak individually with PMB members before the 10th December meeting, in relation to some areas of the project. AS reported that he had already compiled a list of issues for GridPP4 and the work that was needed. DB would try and find some time for them both to meet. AS would check his diary, but considered that it might be preferable if he came to Glasgow for the meeting. DB noted that at the end of the Imperial College meeting, the shape of the GridPP4 proposal must be decided, as well as agreeing who would work on what. Outlines would be required before the middle of January. The OC was on 4th February, at which time we were required to give a detailed outline of the proposal plus a financial estimate. There would be another PMB F2F at RHUL on Tuesday 13th April 2010, and the GridPP4 proposal would have to be signed-off at that meeting. A separate PMB F2F would be required before this, date TBC, depending on whether the 4th Feb OC date can be changed. Timeline summary was as follows: - Nov-2009: Informal discussions between DB and various members of the PMB intended to identify issues and constraints. - Dec 10th 2009: F2F PMB meeting at IC. Goal here is to agree scope of proposal and assign work to people. - Jan 15th 2010: First outline (headings plus point-form content). Documents to OC needed by the following week. - Feb 4th 2010: Oversight Committee - have asked STFC to move to March but suspect this will not happen. [If the OC meeting goes ahead, then we may need a F2F soon thereafter; if the OC meeting is moved to March, then we may need a F2F 2-3 weeks prior to that meeting.] - Feb 26th: Full draft of proposal. - Mar 19th: Second draft. - Apr 13th 2010: Final draft and F2F meeting at RHUL. - Apr 28th 2010: Submission deadline - Jun 30th/Jul-1st: PPRP meeting (presumably a presentation). 4. Status of Tier-1 Issues =========================== a) EMC hardware ---------------- AS reported that there had been a disaster management meeting today - they were still hunting for the underlying cause of the destabilised raid arrays. Experiments with the electrical connections had been carried out. It was looking like it might be a power supply issue. Some RF diagnostic filtering had been done on the racks. They were narrowing down the field of search but it could be another month for further search and testing of issues. An audit had been carried out of the existing temporary solution (use of the old hardware for CASTOR & 3D etc), in order to identify areas of concern. They were dealing with maintenance issues and considering reverting to the old configuration, however this won't happen yet, but it was intended to be completed prior to first injection. b) CASTOR data loss -------------------- AS had circulated a preliminary report last week explaining the understanding of the cause of data loss. In brief, this was as follows: A decision had been made to roll-back the database to a previous point in time when it was stable - to 3rd October at 15:20 hrs, and a restore was carried out. However, underlying hardware problems arose. Twenty minutes following the initial restore, the databases were re-started and the prod array didn't come up, but the mirror array did. Staff didn't realise this, and the mirror array was running from 23rd September. The restore was therefore unknowingly carried out to 23rd September and the intervening data was wiped. Discussions with ORACLE were currently taking place in relation to the array configurations but no conclusions had been reached as yet. It was understood however, that the Tier-1 could not continue to run with mirrored systems. It would be preferable to run with a single instance copy until further recommendations were forthcoming. The database team were looking at recovery processes in conjunction with ORACLE. They were also talking to CERN about the CERN configuration. c) Current Disk Procurement ---------------------------- AS reported that Lot 2 of disk servers had failed acceptance. They were working with the supplier to identify the cause. Good progress had been made in the last week. Multiple problems in the hardware had probably been identified by the manufacturers (who have been working hard to assist). Three RAID controllers were currently under test (so far results were promising). It seemed increasingly likely that the underlying cause will be identified within 1-2 weeks. It remains to be seen if replacement hardware can be in place in time to commence acceptance testing by mid-November. Likelihood of a deployable solution by end of December remains at 30%. 2)New procurements have started. - Disk ITT has closed and evaluation will commence this week. Delivery target, December and April. A clarification had been issued and a request to re-tender. This has led to 6 weeks delay in the tender. Delivery now expected end of January. - CPU PQQ had been evaluated. The invitation to tender had been issued on schedule. Delivery target February. d) Aircon ---------- AS reported that there had been 3 cooling incidents in August. 1. The first related to the building management system being rebooted, which took out the aircon. 2. No cause has as yet been found for the 2nd failure. 3. The sensor was measuring over-pressure in the chiller system and shut-down the aircon, and it couldn't be re-started. AS advised that there was a tender happening now in order to resolve issue (1), and monitoring was now in place. Re (3), no explanation was available for the cause, however the pressure limits had been raised on the sensor, which will now raise a callout to the management system rather than shutting the aircon down. The tender was out for additional chillers in order to increase resilience in the pump setup. The equipment had to arrive this FY in Q1. They understood the requirement to minimise risk - there will be a machine ops meeting this week to discuss. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: 1)We have received 9 T10KB tape drives and will begin testing the new hardware shortly (in contention with work on the faulty CASTOR RAID arrays). 2) Procurement is underway for an additional 4*1Gb/s second OPN link to CERN as resiliant backup. 3) Work continues to understand the underlying cause of the CASTOR/FC+FTS RAID array hardware problems. The quality of the electrical supply is being investigated. The data is complex and analysis is proving difficult, however progress may have recently been made but more elapsed time is required to gain confidence that we can eliminate the fault. Service: 1) SAM availability for the OPS VO was 100%. Weekly production report is at: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Experiments_Liaison_Meeting_Oper ations_Reports 2) CASTOR a) This week, the SRM on all four instances will be upgraded to 2.8.2 to address a known bug and required changes in functionality. SI-2 ATLAS weekly review & plans --------------------------------- RJ was absent. SI-3 CMS weekly review & plans ------------------------------- DC reported that they were in the middle of the Post Mortem relating to the October exercise. The exercise had received a variable response from 'it was a disaster' to 'it was a good experience' - depending on where the analysis had been run. SI-4 LHCb weekly review & plans -------------------------------- GP was absent. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) There was a gLite BDII update last week that subsequently needed to be rolled back (possibly a complication introduced due to porting the 3.2 SL5 changes to 3.1 SL4). This was one of the first instances where a release has been rolled back and met with a couple of issues at some sites – for example the documentation for the required changes was not clear to all. One site opted not to roll back due to the difficulty they encountered in actually upgrading! We are looking at the implications. 2) Sheffield and then Lancaster were both affected last week by LHCb job access to their shared areas. Both cases are thought to be due to the client limitations of NFS which become manifest when a large number of jobs start simultaneously. This is an area to be kept under review – solutions would involve somehow configuring the batch systems to stagger job starts, or possibly adjusting the site NFS setups. 3) UK Tier-2s have all been offline for ATLAS during the last week while the RAL CASTOR cleanup has been ongoing. 4) The SL5 migration status was summarised last week in the deployment team as follows: ScotGrid: Glasgow - Yes, ECDF/Durham - No {moving 26th Oct} - NorthGrid: Manchester - Yes, Sheffield/Liverpool - No {moving by the end of Oct} - SouthGrid: RALPP/Oxford - Yes, Rest - No {status to be confirmedt} - LondonGrid: Imperial - No {Test Cluster Ready}, Brunel - No {testing}, QMUL - {delayed with SL5 Issue}, RHUL - delayed UCLCentral {delayed to New Year} A wiki tracking page is being created to track this more closely and also the DPM/dCache upgrade progress to the baseline specified by WLCG/EGEE. SI-6 LCG Management Board Report --------------------------------- There was no LCG last week; there is one tomorrow. SI-7 Dissemination Report -------------------------- Neasan O'Neill had attended e-challenges at Istanbul last week, which comprised a different audience to the usual one. The meeting had been worthwhile. NO was currently working on the 'CUE' proposal. REVIEW OF ACTIONS ================= 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. ONGOING. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. ONGOING. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. ONGOING. 358.2 GP will talk to LHCb and see if they can progress the issue of CASTOR 2.1.8, and come back to us. We would require a strong plea from LHCb that they want this by December. DB would contact Raja Nandakumar. JG would follow this up at CERN. Done, item closed. 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- ONGOING. 359.5 Graeme Stewart, Lee Barnby (experiment reps) each to contact Neasan O'Neill, advising where, specifically, the GridPP website should point to for each of their experiments, in terms of user support information. (DC & RN had already done this). SP to follow-up. ONGOING. 359.6 SP to ensure that Neasan O'Neill updates the GridPP website accordingly (once experiment reps have provided info as to where the GridPP website should point to for each of their experiments, in terms of user support information). ONGOING. 361.3 JC and AS to check Tier-1 and Tier-2 gstat2 results (in relation to SL5 having been discussed at the GDB). ONGOING. ACTIONS AS AT 26.10.09 ====================== 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. SP reported there had been an email exchange and she had sent suggestions on how to move forward. JC had met with Andy Richards at RAL. One of the issues was uncertainty in relation to funding - SP needed more detail re resources and options for the future, also for EGEE-funded manpower at present and what we have signed up to in NGI. She was awaiting a response. DB noted that NG was trying to understand the proposals and the funding fractions. In defining GridPP4 we needed to define these posts and responsibilities. JC noted that JG had suggested going through the EGI proposal document for info. 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- JC would check the situation with Graeme Stewart (who was currently on annual leave). 359.5 Lee Barnby (experiment rep) still to contact Neasan O'Neill, advising where, specifically, the GridPP website should point to for his experiment, in terms of user support information. (Graeme Stewart, DC & RN had already done this). SP to follow-up. ONGOING. 359.6 SP to ensure that Neasan O'Neill updates the GridPP website accordingly (once experiment reps have provided info as to where the GridPP website should point to for each of their experiments, in terms of user support information). 361.3 JC and AS to check Tier-1 and Tier-2 gstat2 results (in relation to SL5 having been discussed at the GDB). JC reported that he had checked this, but information from AS was still awaited. JC noted that there were issues with the results. AOB === SP reminded the PMB that she required the Quarterly Reports immediately. 'Naming & Shaming' would happen next week. The next PMB would take place on Monday 2nd November at 12:55 pm.