GridPP PMB Minutes 366 (09.11.09) ================================= Present: David Britton (Chair), Sarah Pearce, Andrew Sansum, Tony Doyle, Jeremy Coles, Pete Clarke, Steve Lloyd, Roger Jones, Robin Middleton, John Gordon, David Kelsey, Tony Cass, Glenn Patrick, Dave Colling, Neil Geddes (Suzanne Scott, Minutes) Apologies: None 1. Status of Tier-1 Issues =========================== a) EMC hardware ---------------- AS reported that testing had been carried out in the machine room - there were no reliable sockets in the UPS room, the hardware does not run reliably on them. Other tests had been carried out, and they cannot run the EMCs reliably on the UPS supply. AS advised that the electrical team were in dialogue with the UPS manufacturers, and they were trying to escalate a dialogue with the supplier. AS noted that he had a meeting with the procurement team tomorrow. b) CASTOR data loss -------------------- AS reported that the disaster management track was now closed, the cause of the problem had been understood and there was a range of mitigation strategies in place. The longer ongoing track would be looking at data retention generally. c) Current Disk Procurement ---------------------------- AS reported that the supplier was currently testing a range of issues. They had been given a deadline to get the hardware working, following which the equipment would be returned as unusable. Supplier engineers were onsite and a plan to resolve the situation was expected. 2. Issue about the next Hardware Procurement ============================================= AS reported that the procurement was ongonig and a few solutions had been presented by potential suppliers. A decision was being discussed in relation to selection of supplier - the PMB were asked to consider the alternatives. Agreement was reached in relation to a potential supplier. 3. GridPP4 =========== DB reported that a briefing note had been received from Tony Medland regarding the scope of GridPP4. A proposal was expected, of largely the same scope as last time, the document had been helpful and well-thought-out. It was noted that funding would not be greater than GridPP3 levels - it was the annual amount that counted. The document made planning possible now, and DB asked if the PMB had any issues to raise? NG suggested that we should be able to allow for inflation. TD advised that this had been interpreted as 1.8% in terms of the PPE Rolling Grant. DB noted that worrying about small percentages for inflation was premature; there would be much bigger uncertainties. TD noted that the biggest factor was how Estates & Indirects were incurred, ie: computing support didn't incur Estates & Indirects - it was research that was the distinction, which did incur Estates & Indirects. NG suggested that we should put in what we did last time. SL noted this should include fEC. DB advised that we needed to present all posts in an Appendix listing, which should include the post itself, the cost, and why the post was needed. The JeS forms could go to SP, and this time round it needed to be clear that we could get these right from the start. JG advised that a deadline should be set for opening them. PC suggested that the cases on the JeS forms should comprise a one-line label only, which should point to the full information within the GridPP4 case, detailed in an Appendix - this would avoid duplication of documents. TD advised that the PI and Co-PI should be shown as unfunded effort. DB advised that a common policy was required and possible guidelines issued - this should be discussed at the F2F meeting. DB noted that a CB meeting was required in the 2nd week of December. PC noted that at Edinburgh, at College level, there was a belief that all academics bring in 60% of their time - and there was strong pressure that academics justify themselves. In relation to timeline, DB noted that the April 14/15 meeting clashed with GridPP24. We could not move the GridPP24 date, and the PPRP would require most of the PMB to attend the PPRP presentation. RJ asked where this was taking place? TD noted we could perhaps influence the location? DB would check and see if it might be possible for the PPRP meeting to be held in London. ACTION 366.1 DB to check and see if it might be possible for the 14/15 April PPRP meeting to be held in London. DB noted that the final submission of the GridPP4 proposal was March 4th. The review by the OC on 4th February could be helpful, but response time will be short, therefore the period ahead was challenging. Having a draft available by 28th January was challenging. Timeline was as follows: April 14/15: PPRP presentation - clashes exactly with GridPP24 - to which we financially committed at the start of this week. March 4th: Final submission. February 4th: Review of first draft by OC. January 28th: Draft-1 proposal submission January 14th overnight, meeting on 15th: early start F2F - to review Draft-0. December 10th: F2f at Imperial - full outline of proposal and final assignment of all sections to authors. November 9th: First weekly PMB meeting to kick off proposal and initiate immediate actions where possible. DB asked whether a F2F in early January would be possible? NG noted that this would depend on the Agenda, and whether the proposal could be worked on during the day? JG noted this would be possible if people stayed overnight beforehand - papers could be worked on first thing in the morning and people could be split into teams p.m. DB proposed 14-15 January for the PMB meeting. It was agreed to meet at Glasgow, the meeting would take place on 15th, with travelling expected on 14th, and an early start for the meeting was to be assumed. It was noted that the F2F at Imperial on 10th December would be critical for assigning roles & tasks in the intervening period. A Financial Model would be required then - significant changes in costs would need to be checked. SP noted that she assumed posts would be costed. TD noted that SL had information re costings but that we would not receive final information until the JeS forms were submitted. DB advised that hardware costs were required, he had circulated an email with proposed actions, as follows: ACTION 366.2 RJ to provide ATLAS HW requirements for 2011-15 ACTION 366.3 DC to provide CMS HW requirements for 2011-15 ACTION 366.4 GP to provide LHCb HW reqiremens for 2011-15 It was understood that the experiment numbers needed to be consistent - collaboration was required between the experiment reps. Issues relating to 'other' experiments should go via the User Board - GP could check this. SL commented that to estimate the fraction, would the guideline in relation to non- LHC hardware be ~10%? And what about upgrades? TD noted that these needed to be factored-in to LHC experiment requests? DB noted that the upgrades were separate requests from the running experiment requirements. ACTION 366.5 SL/DB to estimate what fraction of STFC funding goes to non-LHC groups. What about the theory side? ACTION 366.6 GP to invite input from Other Experiments. There was a discussion re tape issues. Further actions were as follows: ACTION 366.7 DB (in consultation with AS) to provide HW-cost estimates for 2011 - 2015. ACTION 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. ACTION 366.9 RJ to confirm that ATLAS supports the use of Tape storage in the period 2011-2015. ACTION 366.10 DC to confirm that CMS supports the use of Tape storage in the period 2011-2015. ACTION 366.11 GP to confirm that LHCb supports the use of Tape storage in the period 2011-2015. ACTION 366.12 SP to liaise with AS to establish non-capacity costs. ACTION 366.13: SP to request and collect first cost estimates of posts for GridPP4. FEC and non-FEC posts need to be costed. The Tier-1 posts should be costed as accurately as possible as soon as possible since there is a large lever arm here. ACTION 366.14 DK to provide first estimate of average RAL post cost on the basis of the current distribution of posts/grades. Clearly this will need refinement once we understand the final mix better. It was discussed that everyone involved should open a JeS form with one RA in order to consider costings. It was noted that project management and the risk schedule needed to be revised. NG asked if a skeleton document was being produced? DB noted yes, he had started with the GridPP3 proposal and will continue to revise this - he would provide a skeleton document to work from. DB noted in addition, that we needed a plan to cover dissemination/outreach/KE & EI. We also needed to consider how GridPP linked with EGI/NGI. A model for the Tier-2s was required. GP asked about experiment posts and priorities? Was there some european funding? See item 6 of the GridPP4 brief - boundaries needed to be clearly defined. TD noted that SL had an integrated sum for small Tier-3 farms, and this would be useful for DB to know - the model would be constrained by that. SL noted there was a common category. NG advised that we could justify generic Tier-4 posts in item 6 of the GridPP4 brief. DB advised that he would document these issues for the next PMB. We would need clear information from the experiments. DB asked if there was anything else to include in relation to GridPP4? DC asked that, re the nature of the Tier-2, would it be possible to have a smaller meeting, sooner? TD advised that we needed CB input - was the existing Regional structure supportable? DC noted we could only take options to the CB. DB summarised the meeting, saying that we have to get this project started this week - there were over a dozen actions generated and issues would be raised again at next week's PMB. There was no time left this week to consider either Standing Items, or actions outstanding. Any reports sent by email are noted below: STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS provided the following report: Fabric: 1) Lot 2 of disk servers have failed acceptance. We have escalated with the supplier and have set a deadline of 3 weeks by which time we require a working configuration. The equipment is very unlikely to be production ready in time for January. 2)New procurements have started. - Disk ITT has closed and evaluation is complete. Expect to inform suppliers of outcome this week. - CPU tender is closed and evaluation is underway. Indication is that evaluation will be straightforward. 3)We have received 9 T10KB tape drives and will begin testing the new hardware shortly (in contention with work on the faulty CASTOR RAID arrays). 4) Procurement is underway for an additional 4*1Gb/s second OPN link to CERN as resiliant backup. Order has not yet been placed (Shared service centre preparing a single company tender case). 5) Work continues in order to understand the underlying cause of the CASTOR/FC+FTS RAID array hardware problems. Considerable testing took place last week but we have not yet identified the underlying cause. 6) On Tuesday 10:00 we lost inbound transfers from out Tier-2 sites. This was traced (on Wednesday) to a rule change introduced by the network team onto our lightpath router. This highlighted the need for better monitoring of the quality of Tier-2 transfers. We would have detected this fault sooner on the CERN link. 7) There were problems with recently written tapes having corrupted double eofs. The consequence of this is that tape media corrupted in this manner is unwritable and causes drives to offline and migration to halt. Sun are investigating. The problem tapes are now marked 'read only' and there is no indication that any data has been damaged. Probablem may have commenced with the recent microcode update in October. Since changes introduced by Sun engineers last week, recently written tapes don't show the problem. Sun are still on-site investigating. Service: 1) SAM availability for the OPS VO was 100%. Weekly production report is at: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Experiments_Liaison_Meeting_Oper ations_Reports SI-5 Production Manager's Report -------------------------------- JC provided the following report: 1) There has been another kernal vulnerability (a NULL pointer dereference vulnerability (CVE-2009-3547) was found in the Linux kernel) identified and this required sites to patch urgently. Following feedback and recommendations after the last such announcement, we have now created a closed list for discussion. The security pages on the GridPP website have been updated this week with revised incident response material (it was in a draft area for several weeks waiting for community comment - of which there was none!). 2) We have been unable to progress with deployment of glexec as requested by the WLCG MB as the SL5 WN with glexec has not been released to production. 3) The CREAM CEs at RAL and Glasgow were moved to "production" last week (from the previous status of "special"). No issues have been reported and ATLAS condorg testing against Glasgow appears to be going well. 4) The October league table from EGEE is available: https://edms.cern.ch/document/963325/ . UKI achieved 91% and 90% for reliability and availability respectively (this compares with 91% and 89% for September). Figures are noticeably down compared to recent performance for RAL-Tier-1 (various downtime associated with database issues), QMUL (hardware and software issues; NFS server down; unscheduled Lustre Filesystem upgrade) and Birmingham (site BDII problems). ACTIONS AS AT 09.11.09 ====================== 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. SP reported there had been an email exchange and she had sent suggestions on how to move forward. JC had met with Andy Richards at RAL. One of the issues was uncertainty in relation to funding - SP needed more detail re resources and options for the future, also for EGEE-funded manpower at present and what we have signed up to in NGI. She was awaiting a response. DB noted that NG was trying to understand the proposals and the funding fractions. In defining GridPP4 we needed to define these posts and responsibilities. JC noted that JG had suggested going through the EGI proposal document for info. 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- JC would check the situation with Graeme Stewart (who was currently on annual leave). 359.5 Lee Barnby (experiment rep) still to contact Neasan O'Neill, advising where, specifically, the GridPP website should point to for his experiment, in terms of user support information. (Graeme Stewart, DC & RN had already done this). SP to follow-up. ONGOING. 359.6 SP to ensure that Neasan O'Neill updates the GridPP website accordingly (once experiment reps have provided info as to where the GridPP website should point to for each of their experiments, in terms of user support information). 361.3 JC and AS to check Tier-1 and Tier-2 gstat2 results (in relation to SL5 having been discussed at the GDB). JC reported that he had checked this, but information from AS was still awaited. JC noted that there were issues with the results. 366.1 DB to check and see if it might be possible for the 14/15 April PPRP meeting to be held in London. 366.2 RJ to provide ATLAS HW requirements for 2011-15 366.3 DC to provide CMS HW requirements for 2011-15 366.4 GP to provide LHCn HW reqiremens for 2011-15 366.5 SL/DB to estimate what fraction of STFC funding goes to non-LHC groups. What about the theory side? 366.6 GP to invite input from Other Experiments. 366.7 DB (in consultation with AS) to provide HW-cost estimates for 2011 - 2015. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. 366.9 RJ to confirm that ATLAS supports the use of Tape storage in the period 2011-2015. 366.10 DC to confirm that CMS supports the use of Tape storage in the period 2011-2015. 366.11 GP to confirm that LHCb supports the use of Tape storage in the period 2011-2015. 366.12 SP to liaise with AS to establish non-capacity costs. 366.13: SP to request and collect first cost estimates of posts for GridPP4. FEC and non-FEC posts need to be costed. The Tier-1 posts should be costed as accurately as possible as soon as possible since there is a large lever arm here. 366.14 DK to provide first estimate of average RAL post cost on the basis of the current distribution of posts/grades. Clearly this will need refinement once we understand the final mix better. The next PMB will take place on Monday 16th November at 12:55 pm.