GridPP PMB Minutes 360 (28.09.09) ================================= Present: David Britton (Chair), Sarah Pearce, John Gordon, Andrew Sansum, Tony Cass, Dave Colling, Tony Doyle, Steve Lloyd, Jeremy Coles, Robin Middleton, David Kelsey, Pete Clarke, Apologies: Roger Jones, Glenn Patrick, Neil Geddes 1. Security Patching ====================== There has been increased pressure from EGEE for sites to patch a kernal vulnerability first noted on 13th August. A list containing the patching status of each GridPP site was circulated to the PMB earlier today (and at the same time a history of the vulnerability and reaction from Dave Kelsey). A number of issue were discussed at length, however, it was agreed that the UK was not in bad shape. There was some concerns about false-positives (some sites seen by EGEE as not patched); and about whether we are responding adequately at all levels to such security issues. AS was concerned about the lack of information coming through the operations side to indicate that this had got serious enough for the EGEE PMB to issue such a strong statement. It was agreed that: ACTION: RM to draft response to EGEE PMB on security patch status and circulate. ACTION: JC to follow up on individual sites via the dTeam to ensure Sep 30th deadline for the security patch was met where feasible and bring the issue back to the PMB next week. ACTION: JC to let sites know how to check their security status themselves. Finally, it was agreed that DK should have a general action to raise any significant on-going security issues to the PMB so that we might discuss a little earlier in the process in future. 2. Hardware ============= DB noted that UK hardware pledges had been drafted and sent to Tony Medland on Friday. An acknowledgement but no comments had been received. Plan was to send on to wLCG after this meeting. DB noted that the Tier-2 pledges remained the same as last year on the basis that we had easily met the 2009 numbers and already had sufficient resources to meet the April 2010 levels, except for disk. The latter was being installed at various places, so overall, it seemed easiest to leave things unchanged. Beyond 2010, we would need to re-calculate the Tier-2 shares from the revised global requirements and cross-check that the funding would allow these targets to be met. On the Tier-1 side, DB, AS and SP had iterated the highly constrained details here and had agreed a consistent plan that met the various constraints. ACTION: DB to send 2009/10 pledges to wLCG. 3. SSCs ======== DB and DC reported on the discussions about the ROSCOE Specialized Support Centre (SSC) that took place at EGEE09 in Barcelona last week. Things continue to evolve (though perhaps not progress) rapidly.The feeling is that ROSCOE can expect 3M Euro (10 FTE) and each partner was asked about their minimum threshold. For the UK we agree that this is currently 1 FTE (0.5 FTE x 2, to be matched) to work on ATLAS and LHCb-centric GANGA. There was a discussion on whether to lengthen ROSCOE to 4 years to match the assumed change to EGI (due to FP7 funding issues), but the decision was NO. Things finally have to converge this week with a final draft by Thursday, after which there will only be minor changes. SP raised the CUE SSC that deals with dissemination, training, etc. There is QMUL involvement here with Neasan leading NA2 (dissemination); involvement from Imperial in NA3 (realtime monitor); and Edinburgh lead two activities. SP asked whether any UK sites wanted to be listed as Key Infrastructure sites? DC remarked that Imperial had volunteered - there was no effort for this but it was a lightweight responsibility to do with enabling resources for training for specific events. DB noted that Glasgow could probably also volunteer if it was viewed as helpful but we would need to understand the implications of being an unfunded partner. ACTION: SP to report on the implications of being an unfunded partner in CUE and whether this would require quarterly reporting or other bureaucratic nightmares. 4. Sponsorship =============== Viglen had been in touch to ask about sponsorship with Boston for Ambleside. This was discussed in the context of RHUL vs Ambleside and whether anyone had any conflict of interest. TD noted that getting sponsorship for Ambleside had improved impossible last time so we should take advantage of the offer. ACTION: DB to check with RJ. 5. Weekly Notes ================ DB congratulated Stuart Purdie for winning the best poster award at EGEE09. http://gridtalk-project.blogspot.com/2009/09/best-poster-and-best-demo-competition.html DB suggested a news item that would also serve to publicise this work to other sites. TD remarked that it was a light-weight installation that may well help people move from qSub to the Grid. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric ====== 1) Cooling. We are still tracking the cooling issues experienced in August in the Disaster Management system, but have had nofurther problems and are waiting until investigations have reached a conclusion. 2) Water leak. We are still tracking this problem in the Disaster Management system, but a recurrence is unlikely owing to temporary measures in place. Meetings have recently taken place to assess what corrective work needs to be carried out. 3)Lot 2 of disk servers have failed acceptance. We are working with the supplier to identify the cause. Multiple avenues are being followed. We estimate 50% likely to be available by Christmas (next assesment Thurdsay). Separate update to PMB with more info. 4)New procurements have started. - Disk ITT has closed and evaluation will commence this week. Delivery target, December and April. - CPU PQQ has closed and is being evaluated. Delivery target February. 5)We are planning an upgrade of CMS to T10KB drives and are in the process of obtaining quotation for new drives. 6)The UPS system test was completed successfully. 7) The link between the Tier-1 data servers and the WAN has been upgraded to 20Gb/s (from 10Gb/s) matching the combined SJ5+OPN bandwidth. This resolves a bottleneck identified in STEP09. 8) Procurement is underway for an additional 4*1Gb/s second OPN link to CERN as resiliant backup. Staffing ======== 1) We expect the second experiment support post to start on October 5th. Service ======= 1) SAM availability for the OPS VO was 100%. Weekly production report is at: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Experiments_Liaison_Meeting_Operations_Reports 2) CASTOR a) Once again we had a problem with one of the CASTOR RAID arrays (despite engineers apparently correcting the cause of the previous incident). Patches applied to ORACLE ensured the service continued to operate preventing a recurrence of the previous incident where all instances went down. b) 15th September. CASTOR (all instances) taken down for nameserver upgrade (to 2.1.8). When restarted, disk to disk copies failing. This led to an extended outage until 17th September 10:00 (approx). Further investigation suggests the problem was not related to missconfigured LSF scripts, but was related to the way LSF was restarted during the scheduled downtime. http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20090915 c) The CASTOR information provider (CIP) will be upgraded on Tuesday 29th Sept. 3) ATLAS 3D service has been migrated to new hardware. LHCB 3D service migration has yet to be completed. 4) Problems were encountered with the PBS server hosting the SL5 batch system. This was traced to incompatability between client and server versions. We suffered a 50% reduction in SL5 capacity for 1 day until an upgrade to the PBS server resolved the problem. SI-2 ATLAS weekly review & plans --------------------------------- RJ was absent. SI-3 CMS weekly review & plans ------------------------------- DC gave a brief verbal report - a few minor issues. SI-4 LHCb weekly review & plans -------------------------------- GP was absent. SI-5 Production Manager's Report --------------------------------- 1) There has been increased pressure from EGEE for sites to patch a kernal vulnerability first noted on 13th August. A list containing the patching status of each GridPP site was circulated to the PMB earlier today (and at the same time a history of the vulnerability and reaction from Dave Kelsey). 2) We (HEPSPEC06 accounting sub-group SL, JC & Tier-2 coordinators) have not yet converged on a way to recalibrate site accounting data, but plan to hold a meeting at the end of the week. Most sites have now benchmarked and we are hoping that all will have figures by the end of the week. 3) ATLAS hammercloud tests have continued over the last few weeks with sites trying various changes between each run. Some useful observations (and summary of the process) so far were presented to a WLCG audience at EGEE09 last week: http://tinyurl.com/ybhf5rk. Results from the tests last week (possibly also in Roger's report today!): Glasgow: 393M, 98.4% success, 12.6Hz QMUL: 244M, 90.2%, 8.1Hz Liverpool: 176M, 97.9%, 7.1Hz RALPP: 105M, 94%, 7.5Hz At the sites which did less work there are some good results, like Birmingham, but many poor ones: RHUL, Manchester, Oxford, Sheffield running failure rates > 10% (Manchester 70% failures!)Ó. Apparently an ATLAS wide challenge will run 21st-23rd October. 4) EGEE09 took place last week (http://indico.cern.ch/conferenceTimeTable.py?confId=55893) and the most relevant set of talks for deployment and operations came from the SA1 session (http://indico.cern.ch/conferenceDisplay.py?confId=67238): The quick summary: The PPS/roll-out area looks set to change again soon. We need to increase our participation (http://indico.cern.ch/conferenceDisplay.py?confId=67238). Many talks looked at the state of regionalization of tools and these are generally missing their EGEE3 milestone dates (regional portal, GOCDB, Nagios integrationÉ). Signing up to the SLA will become part of joining the grid rather than something sites/T2s do separately. There has been some rationalisation of downtime announcement periods (the rules by which downtime is classed as scheduled/unscheduled). TPM will now remain as it is until the end of the year when the EGI roles are clearer. Some quality metrics are being collected on the performance of the regional operations teams (UKI appears to perform much the same as the other ROCs). 5) The August WLCG availability/reliability report shows these figures from ops tests (http://tinyurl.com/ybks74p): [reliability & availability] LondonGrid Ð 79% & 74% NorthGrid Ð 98% & 97% ScotGrid Ð 95% & 93% SouthGrid Ð 98% & 95% London figures were dominated by RHUL failures which had downtime caused by power failures and central IT maintenance. While these would lead to a reduction in the figures the 0% for both suggests a testing problem Ð this is being followed up. SI-6 LCG Management Board Report --------------------------------- No meeting last week. SI-7 Dissemination Report -------------------------- EGEE stand went well; Neasan was preparing a news item. REVIEW OF ACTIONS ================= 350.5 JC to check and verify that the contact list on the GOCDB is up-to-date - to be done by September. DONE - NGS sites are not but AR has been advised. 354.1 JC to get more info on e-NMR status and report-back; JC to also raise this issue of GridPP support for them at dTeam. DONE - JC talked to them at EGEE09 was happy with what they were doing. 359.2 In the context of hardware pledges and figures, JG to email Tony Medland and give him a heads-up that figures were coming. DONE - at least, DB had done this without seeing the action. 359.3 SL to convene an Accounting & Benchmarking sub-group, comprising SL, JC, and the four x Tier-2 Co-ordinators, to meet on 2nd October to discuss the figures and follow the action plan as outlined below: DONE The meeting closed at 2.55 pm. The next PMB would be held on Monday 5th October 2009 at 12:55 pm (Apologies in advance from SL).