GridPP PMB Minutes 354 (27.07.09) ================================= Present: David Britton (Chair), Robin Middleton, Dave Colling, Pete Clarke, Jeremy Coles, Roger Jones, Steve Lloyd, Sarah Pearce, John Gordon Apologies: Tony Doyle, Andrew Sansum, Glenn Patrick, David Kelsey, Tony Cass, Neil Geddes 1. GridPP4 =========== DB reported that he had contacted Tony Medland re GridPP4. TM had replied that GridPP4 would have to go through the PPRP although he agreed that the scientific case was already established. STFC would give some thought as to how best to use the PPRP in this situation. It was noted that the GridPP4 proposal timeline should be shorter than GridPP3. 2. Support for e-NMR ===================== JC reported that he had received a request from Dave Wallom re EBI, who are a part of a project e-NMR - some countries are involved with this VO in EGEE. There had been a request for support of this VO, which NGS felt would be better done by GridPP because it was the EGEE infrastructure that was required. Oxford were happy to do this, could this VO be added to the GridPP list of supported VO's? DB asked about the project partner? Was money changing hands? RM advised that they had the status of a sub-contractor. DB asked whether they wanted free resources from us, or were they British scientists in the UK wanting to further science? JC advised that he would need to check with Dave Wallom. DB suggested that it was a good idea for GridPP to provide support, but he wanted to clarify the VO's UK involvement. JC advised that from the VO card it looked like they were structural biology at international european level - EC funded to the tune of around 50 users. ACTION 354.1 JC to get more info on e-NMR status and report-back; JC also to raise this issue of GridPP support for them at dTeam. 3. EGEE Releases ================= JC reported that the Manchester VOMS was out of date because of handover to NGS, and NGS being fixed in 3.0 version, which was out- of-date due to issues on the worker nodes. In addition, there were issues in the UK of gLite updates not being installed. There was no real reason for this, other than caution, and sites had been requested to update. Only RHUL could not upgrade gLite on their cluster - other sites are all upgrading ok. DB noted that the production updates should be done routinely at sites unless told otherwise. The instruction should be to do updates by default within a reasonable time. JC noted however that this depended on the releases - the critical ones were more important. RJ advised that a UK policy was required, sites couldn't track everything, especially if they were not important. JC noted that they needed to track if a policy was introduced. DB suggested that this be done along with the Quarterly Reporting. JC agreed that was possible, but it depended on the level of release, and more could be done to track these. SL advised that it could be done automatically. DB noted that the dTeam weekly meetings and quarterly reporting should catch any critical releases. JG noted that the WLCG daily operations meeting should also track these. DB noted that a policy was required re updating within a certain timeline, then escalation if necessary. SP advised that this issue had originally been a metric. JC noted he could make progress with this at dTeam. ACTION 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. JC noted that in relation to storage, Jens Jensen was taking this forward, and he would make recommendations to the PMB in due course. 4. R-GMA/APEL ============== JG noted there had an incident on 20th of June, and the APEL central repository had stopped receiving data from sites. This was due to the archiver hardware being replaced and not configured optimally, so although it all tested fine, everything ground to a halt gradually after a couple of weeks. It took a few days to fix, and a longer period to catch up, but it is now all working correctly. R-GMA was not the cause of the problem, it was in the internal processing by APEL. Replacing R-GMA by ActiveMQ would not have prevented this problem. Wrt the replacement of R-GMA for APEL, internal testing of a replacement client was ongoing, with wider tests planned for September, followed by a certification period prior to production release around the end of the year. JG reported that ActiveMQ was the same message protocol being used by the Operations Automation Team within EGEE for monitoring. DB asked JG if he would provide a brief summary to Ian Bird at the MB? JG agreed. ACTION 354.3 JG to provide a brief summary of the R-GMA situation to Ian at the MB RM noted that regarding long-term support for R-GMA, GridPP funding runs to the end of this financial year, so leaving things a bit tight for a handover. 5. Week's Notes ================ - Hardware: DB asked if there were any known hardware outcomes from the LHCC? No info had been forthcoming. DB asked if the numbers that ATLAS had presented had now been accepted? RJ thought so, yes. DB noted that previously the LHCb numbers had not been similarly presented (as those for ATLAS and CMS) - an integrated requirement for the next period had been used, rather than peak requirement. It was understood that LHCb had been asked to re-do their figures, therefore the numbers had gone up. DB noted that he was unsure of the numbers given, they seemed incorrect and were currently being examined. DB advised that we needed to do a CPU purchase early rather than late - CPU purchase and some disk for FY09; with a 2nd tranche of disk for FY10. - OPN backup link: DB reported that the OPN backup-link case had been submitted to STFC in June. DB had followed this up but no response had been forthcoming as yet. 6. Breakout of Action 351.3 ============================ DB advised that breakout actions were required for the OC documents, and outlined requirements on individuals as follows, with timeline: 1) Project Status: Ed: DB (16 to 20 pages) a) Introduction (1-page) DB b) LCG Status (<1-page) TC c) EGEE/EGI Status (1-2 pages) RM with input from NG d) Tier-1 Status (2-4 pages) AS e) UK Deployment (2-4 pages) JC with input from SL f) TD report (1-page) TD g) User reports: i) ATLAS (~2-page including figures) RJ ii) CMS (~2-page including figures) RJ iii) LHCb (~2-page including figures) RN/GP h) User Board report (1-page) GP j) EI, KT and dissemination summary (1-2 pages) SP. 2) ProjectMap Report (SP) including risk register. 3) Resource Report (SP) 4) Case for OPN back-up link (PC - done) 5) CASTOR progress report (AS - requested explicitly by OC - a few pages). 6) Something on EGI/NGI/NGS and future scenarios (point 35 and Action from last OC). Probably JG is the best person. Timeline: Aug 10th: First drafts of Project Status sections to DB. **NEXT PMB - August 10th** Aug 17th: First complete draft of all other papers Sep 1st: Final version of papers available Sep 7th: F2F at Cambridge and OC papers submitted. Sep 15th: OC at MRC in London STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS was absent. SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that a hammercloud test was going on at present; it was likely to be relatively quiet during August due to annual leave. Monte Carlo production will be organised - the UK continues to be tested. Some sites are not performing well - Leicester was having difficulty due to a Storm upgrade; RHUL wasn't looking good re SL's tests, but seemed ok on the dashboard; ECDF was looking bad, but it was recognised that they do not provide resources at present. SL noted they were installing nightly kits at two sites just now to test for bugs. SI-3 CMS weekly review & plans ------------------------------- DC reported that things were quiet on the computing front; they had started taking data - there was a short discussion on automatic data movement from Tier-0 to Tier-1. The Tier-2s looked good last week @ >80% availability. There had been a power outage at Imperial SI-4 LHCb weekly review & plans -------------------------------- GP was absent. SI-5 Production Manager's Report --------------------------------- JC reported on some topics from deployment and operations: 1) We have a request coming via David Wallom as follows: "We have been asked by members of the e-NMR VO (physically from EBI) if they can be supported on the NGS. They are already an EGEE blessed VO (enmr.eu) and as such we considered the best way to get this off the ground would be to approach our local (Oxford T2) to get them included on their list as a trial. Beyond that it would be great if they could become members of the supported list and hence added into other UK GridPP sites. In the longer term we will be using this as a way of making sure there is alignment between the NGS and GridPP in terms of VO and application installation and management support, as well as polwer level differences such as data staging etc." The Oxford T2 site has been approached directly and will be supporting this VO. Before adding it to the "GridPP supported VOs" list I need to check with the PMB. EBI is the Europaen Bioinformatics Institute and a sub-contractor on the e-NMR (http://www.enmr.eu/eNMR-partners) project. 2) VOMS at Glasgow and Manchester has now been upgraded to gLite 3.1. There are still other sites/services running on gLite 3.0, but only RHUL is unable to upgrade all of its capacity to the latest (required) GFAL and lcg_utils versions since they are still running RHEL3 (which remains a supported platform for them). The storage group is looking at the question of SE update (recommended) timelines. WLCG baselines were kept up-to-date here: ttps://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions. 3) Last week most of the UK ATLAS T2 sites got involved in additional testing of their site bottlenecks by way of extended hammercloud submissions. The idea was to vary the number of allowed concurrent jobs running on the site(s) and see how the efficiency and loads were handled. Unfortunately the panda server itself had problems after a day of running and although the exercise was useful, there is a plan to run a similar set of tests this week (from Tuesday). 4) At the UKI meeting last week there was a request for all sites to check that there accounting is up-to-date in APEL for the last quarter. APEL had problems at the end of June and many gaps still remain even though sites have (mostly) tried to republish. The accounting data is required for the T2 reports and also the hardware allocation calculations. 5) All sites are now engaged with the HEPSPEC2006 benchmarking and many have undertaken the DB requested comparison between batch log data and APEL data (to ensure consistency of scaling). The 2006 figures are being published in a wiki page: http://www.gridpp.ac.uk/wiki/HEPSPEC06. SI-6 LCG Management Board Report --------------------------------- It was noted that there was not much to report - a meta-summary of STEP'09 had been circulated, mainly covering issues that were already known. SI-7 Dissemination Report -------------------------- SP reported that Neasan O'Neill was away this week; there were a few news items pending. REVIEW OF ACTIONS ================= 346.3 PC & AS to speak to Robin Tasker regarding receipt of high- level or ticket information from JANET on the service. AS had sent a reminder - no reply as yet. ONGOING. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job- aborts due to incorrectly setup environments. This was still in progress. ONGOING. 350.5 JC to check and verify that the contact list on the GOCDB is up- to-date - to be done by September. ONGOING. 351.3 ALL: drafts of all reports for the OC to be sent to DB by 10th August (project status papers) and 17th August (all other papers). ONGOING. 353.1 DB to contact Jamie & express UK interest in PP-only SSC - to re-badge effort until 2011, and not involving transfer to CERN - we would want personnel to be UK-based. This would help continue the experiment support posts in GridPP4. DB reported that he had discussed this with Jamie and it was useful to have GridPP or RAL as a partner; DB was keeping a watching brief. DONE, item closed. ACTIONS AS AT 27.07.09 ====================== 346.3 PC & AS to speak to Robin Tasker regarding receipt of high- level or ticket information from JANET on the service. AS had sent a reminder - no reply as yet. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job- aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. 350.5 JC to check and verify that the contact list on the GOCDB is up- to-date - to be done by September. 351.3 ALL: drafts of all reports for the OC to be sent to DB by 10th August (project status papers) and 17th August (all other papers). 354.1 JC to get more info on e-NMR status and report-back; JC to also raise this issue of GridPP support for them at dTeam. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. 354.3 JG to provide a brief summary of the R-GMA situation to Ian at the MB 354.4 DB to co-ordinate 16-20 page Project Status report for the OC and ensure it is submitted on time. 354.5 DB to write a 1-page Introduction for the OC Project Status Report. 354.6 TC to write a 1-page report on LCG Status for OC Project Status Report. 354.7 RM, with input from NG, to write a 1-2 page report on EGEE/EGI Status for the OC Project Status Report. 354.8 AS to write a 2-4 page report on Tier-1 Status, for the OC Project Status Report. 354.9 JC, with input from SL, to write a 2-4 page report on UK Deployment, for the OC Project Status Report. 354.10 TD to write a 1-page Technical Director's report, for the OC Project Status Report. 354.11 RJ to provide a 2-page User Report, to include relevant figures, on behalf of ATLAS, for the OC Project Status Report. 354.12 DC to provide a 2-page User Report, to include relevant figures, on behalf of CMS for the OC Project Status Report. 354.13 GP (& RN) to provide a 2-page User Report, to include relevant figures, on behalf of LHCb for the OC Project Status Report. 354.14 GP to provide a 1-page User Board Report, for the OC Project Status Report. 354.15 SP to provide a 1-2 page summary report on EI/KT & Dissemination, for the OC Project Status Report. 354.16 SP to provide the ProjectMap Report, to include the Risk Register, for the OC Project Status Report. 354.17 SP to provide the Resource Report, for the OC Project Status Report. 354.18 The Case for OPN back-up link, already completed by PC, to be included in the OC Project Status Report - DB to do. 354.19 AS to provide a specific progress report on CASTOR, a few pages long, as requested explicitly by the OC, for the OC Project Status Report. 354.20 JG to provide a report on EGI/NGI/NGS and future scenarios (point 35 and Action from last OC meeting), for the OC Project Status Report. TIMELINE FOR REPORTS: Aug 10th: First DRAFTS of Project Status sections to DB. **NEXT PMB - August 10th** Aug 17th: First COMPLETE draft of all other papers Sep 1st: FINAL version of papers available Sep 7th: F2F at Cambridge and OC papers SUBMITTED Sep 15th: OC meeting at MRC in London DB noted that he was out of contact from Wed 29th until Friday 31st. The next PMB would take place on Monday: August 10th at 12:55 pm. Agreed further dates were as follows: Aug 10th - PMB Aug 17th - cancelled Aug 24th - PMB Aug 31st - cancelled (UK Bank Holiday) Sep 2nd - special PMB to address issues re OC meeting Sep 7th - F2F at Cambridge