GridPP PMB Minutes 359 (21.09.09) ================================= Present: Sarah Pearce (Chair), John Gordon, Andrew Sansum, Tony Cass, Dave Colling, Tony Doyle, Steve Lloyd Apologies: Jeremy Coles, Robin Middleton, David Kelsey, David Britton, Roger Jones, Pete Clarke, Glenn Patrick, Neil Geddes 1. Report from the OC Meeting ============================== DB had circulated a summary email. The meeting had gone well and the committee appeared helpful. TD reported that the OC had highlighted some challenges ahead; performance in STEP had been exemplary; the OPN document needed to be of a better standard, although the link itself was approved; experiment reps would attend the next meeting; a doodle link has been sent round by the OC re the next 2 meetings in Jan & June 2010. TD reported that several issues had been raised for the future re sustainability, funding, and staffing status. JG would circulate the EGEE document on cloud computing for further discussion. TC noted that this had also been discussed at CHEP and would provide the relevant urls. ACTION 359.1 In the context of GridPP sustainability as highlighted by the OC, JG to circulate the EGEE document on cloud computing for further discussion. TC to provide the relevant urls from the CHEP talks. TD noted that all of the info could be collated and provided as a review document. There was a discussion re wLCG, funding, EGI and SSCs. The funding situation might be clearer by Jan 2010. Regarding staff, would sysadmins be re- classed as technicians or RAs in relation to fEC? TD advised that we should be collating papers produced during this period in order for anyone to be classified as a 'Researcher' in GridPP4. This fact should be highlighted to staff and PIs. SP noted that she would need to discuss the GridPP4 proposal with DB. 2. Hardware Pledges ==================== It was noted that these need to be finalised by the end of this week. SP confirmed she had received email information. AS advised that he needed to decide how much tape we need to deliver to experiments - AS was awaiting a reply from DB re targets. SL asked why we couldn't work from the table and fractions? AS would check. JG suggested that we notify Tony Medland that information is coming. AS advised he could deal with the figures tomorrow. JG to email TM in the first instance, and give him a heads-up that figures were coming. ACTION 359.2 In the context of hardware pledges and figures, JG to email Tony Medland and give him a heads-up that figures were coming. SL advised that there were two approaches: the first was that we could take the global requirements and multiply by the UK fraction, and say that we will provide this, then price the hardware accordingly. TD noted that there was, in addition, the 20% we hold back. SL noted that the second approach could be to work out how much kit we can buy at DB's figure, then pledge 80% of that. Which scenario was better? SL noted that he had carried out the first approach - we have plenty of CPU to meet requirements as they stand, but we could not meet the April 2010 disk. The figures for disk requirement were higher than previously pledged. SL advised that we were 600TB short at present. SL noted that we could confirm the old numbers and say nothing had changed. TD agreed. However, longer term projections were required. SP suggested that we try and have the numbers by Wed a.m., including advice from DB. 3. Actions from the Deployment Board meeting ============================================= SL reported that he had looked through the Minutes, there was nothing obvious re a PMB action. It was agreed however to add two actions to the PMB listing: ACTION 359.3 SL to convene an Accounting & Benchmarking sub-group, comprising SL, JC, and the four x Tier-2 Co-ordinators, to meet on 2nd October to discuss the figures and follow the action plan as outlined below: --------------------------- This action related to Action 5.3 from the Deployment Board: 05.03 The Accounting & Benchmarking sub-group, comprising SL, JC, and the four x Tier-2 Co-ordinators, would meet on 2nd October to discuss the figures and follow the action plan as outlined above (see points 1-7 reproduced below, with the note following). --------------------------- DC suggested that setting up a sub-group to handle this specific issue would be useful. DB agreed, proposing the four tier-2 co-ordinators plus a couple of senior people (including SL) to moderate - there should be a proposal to go to the Deployment Board. DC proposed the following course of action: 1. specint them all 2. calculate what we can 3. adjust the ones we can't 4. compare the adjustment with those who haven't done it properly 5. if within 10% then ok 6. set-up a sub-group comprising JC, SL and the four Tier-2 Co-ordinators 7. agree timescale Figure was £400k this financial year, from STFC. One month only could be allowed for convergence, as time was short - proposed date was Friday 16 October. SL advised that it should not advantage sites who can't do it. Decisions should be referred to the PMB. This was all agreed. --------------------------- ACTION 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). 05.04 dTeam to publicise the 1st October as the changeover date to HEPSPEC06. --------------------------- 4. PPAP Roadmap ================ TD asked that the extract from the PPAP Roadmap, prominent on p3, be reproduced here: --------------------------- Exploitation of the facilities listed in Table 1 relies upon access to large-scale computing resources. The UK has been instrumental in developing the Grid computing infrastructure which underpins global Particle Physics, and through the GridPP project is a major contributor in particular to the worldwide LHC Computing Grid. GridPP enables UK researchers to take the lead in LHC physics analyses and its continued support is absolutely vital for the field. --------------------------- This was a useful paragraph, and people should be informed that it was incorporated. 5. Week's Notes ================ SP highlighted the SSC discussion due to take place this Friday at EGEE'09. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) Cooling. We are still tracking the cooling issues experienced in August in the Disaster Management system, but have had no further problems and are waiting until investigations have reached a conclusion. 2) Water leak. We are still tracking this problem in the Disaster Management system, but a recurrence is unlikely owing to temporary measures in place. 3) Lot 1 of disk servers are available for deployment. 4) Lot 2 of disk servers have failed acceptance. We are working with the supplier to identify the cause. Multiple avenues are being followed. We estimate 50% likely to be available by Christmas, however news received today suggests we may revise our estimate upwards shortly. Various tests are expected to complete this week and next week. 6) New procurements have started. - Disk ITT has closed and evaluation will commence this week. Delivery target, December and April. - CPU PQQ has closed and is being evaluated. Delivery target February. 7) We are planning an upgrade of CMS to T10KB drives and have a target date of December for this assuming financial approval for the tape drives is agreed this week. 8) The UPS system will be tested on Tuesday 22nd 08:00-10:00. There is some risk to the Tier-1 from power failure, particularly when the switch back to the mains supply occurs. Staffing: 1) We expect the second experiment support post to start on October 5th. Service: 1) SAM availability for the OPS VO was 71%. Downtime mainly due to problems restarting CASTOR following the nameserver upgrade to 2.1.8. Weekly lost time report is at: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Experiments_Liaison_Meeting_Oper ations_Reports 2) CASTOR - Two major downtimes: * 10th September. One of the two CASTOR database RAID arrays went offline (cause was subsequently traced to a cable error with its cabling to its UPS supply). ORACLE should have carried on on the second array but in fact went down bringing all CASTOR instances down. Service was down for about 5 hours. ORACLE problem was traced to a known bug and subsequently ORACLE have provided a patch which will be applied today at 12:00. * 15th September. CASTOR (all instances) taken down for nameserver upgrade (to 2.1.8). When restarted, disk to disk copies failing. This led to an extended outage until 17th September 10:00 (approx). Cause was eventually found to be due to misconfigured LSF startup scripts (almost certainly unrelated to the upgrade). We do not know how the miss-configuration occured nor why it only manifested during the scheduled upgrade. Draft post mortem containing the incident details is at: http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20090915 - All SRMs now upgraded to 2.8 - CASTOR nameserver upgrade to 2.1.8 3) The FTS and LFC have been moved to a new resilient ORACLE RAC and the ATLAS LFC has been separated from the general LFC. 4) ATLAS 3D service has been migrated to new hardware. LHCB 3D service migration has yet to be completed. 5) SL5 migration has been completed. Approximatly 75% of capacity has been migrated. The remainder continues to run SL4. SI-2 ATLAS weekly review & plans --------------------------------- RJ was absent. SI-3 CMS weekly review & plans ------------------------------- DC was absent. SI-4 LHCb weekly review & plans -------------------------------- GP was absent. SI-5 Production Manager's Report --------------------------------- JC was absent. SI-6 LCG Management Board Report --------------------------------- JG noted nothing to report. He would check today and circulate info. SI-7 Dissemination Report -------------------------- SP reported all were presently at EGEE'09. REVIEW OF ACTIONS ================= 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. ONGOING. 350.5 JC to check and verify that the contact list on the GOCDB is up-to-date - to be done by September. ONGOING. 354.1 JC to get more info on e-NMR status and report-back; JC to also raise this issue of GridPP support for them at dTeam. ONGOING. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. ONGOING. 354.4 DB to co-ordinate 16-20 page Project Status report for the OC and ensure it is submitted on time. DONE, ITEM CLOSED. 354.5 DB to write a 1-page Introduction for the OC Project Status Report. DONE, ITEM CLOSED. 354.6 TC to write a 1-page report on LCG Status for OC Project Status Report. DONE, ITEM CLOSED. 354.7 RM, with input from NG, to write a 1-2 page report on EGEE/EGI Status for the OC Project Status Report. DONE, ITEM CLOSED. 354.9 JC, with input from SL, to write a 2-4 page report on UK Deployment, for the OC Project Status Report. DONE, ITEM CLOSED. 354.10 TD to write a 1-page Technical Director's report, for the OC Project Status Report. DONE, ITEM CLOSED. 354.11 RJ to provide a 2-page User Report, to include relevant figures, on behalf of ATLAS, for the OC Project Status Report. DONE, ITEM CLOSED. 354.12 DC to provide a 2-page User Report, to include relevant figures, on behalf of CMS for the OC Project Status Report. DONE, ITEM CLOSED. 354.13 GP (& RN) to provide a 2-page User Report, to include relevant figures, on behalf of LHCb for the OC Project Status Report. DONE, ITEM CLOSED. 354.14 GP to provide a 1-page User Board Report, for the OC Project Status Report. DONE, ITEM CLOSED. 354.15 SP to provide a 1-2 page summary report on EI/KT & Dissemination, for the OC Project Status Report. DONE, ITEM CLOSED. 354.16 SP to provide the ProjectMap Report, to include the Risk Register, for the OC Project Status Report. DONE, ITEM CLOSED. 354.17 SP to provide the Resource Report, for the OC Project Status Report. DONE, ITEM CLOSED. 354.20 JG to provide a report on EGI/NGI/NGS and future scenarios (point 35 and Action from last OC meeting), for the OC Project Status Report. DONE, ITEM CLOSED. 355.4 JG to do a draft Agenda for the e-science review visit. JG reported that this was now done but not circulated. He will circulate today and ask for feedback. ONGOING. 356.1 JG to deal with EGI issues for EGI section of the OC document. DONE, ITEM CLOSED. 356.2 In the context of the e-Science Review document, re the STEP'09 note and draft distribution rates - was it possible to put these numbers into perspective? RJ to provide DB with targets/rates context for STEP'09 and draft distribution rates; RJ to provide appropriate wording on figures meeting the requirements for Tier-1 running, eg: 'these figures exceed the requirements of the Tier-1 for initial running' - or something similar ; RJ to provide DB with info on Tier-2 numbers, ie: how many Tier-2s were there. ONGOING. 356.3 In the context of a discussion on HEPSPEC06 benchmarking, there were issues of having enough data, and the different way used to calculate hours, also the comparison between HEPSPEC numbers compared with prior SPECINT values. DB to discuss the issue of HEPSPEC06 benchmarking with SL and JC offline, and raise an appropriate action following discussion. ONGOING. 356.4 A new individual document on the case for the OPN back-up link to be prepared for the OC by DB and PC, addressing all issues required. DONE, ITEM CLOSED. 357.1 ALL: to read the Project Status report and send amendments to DB by Friday. DONE, ITEM CLOSED. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. ONGOING. 358.2 GP will talk to LHCb and see if they can progress the issue of CASTOR 2.1.8, and come back to us. We would require a strong plea from LHCb that they want this by December. ONGOING. 358.3 AS will review 2.1.8 statements in the CASTOR paper in light of the LHCb situation, and make them less definitive. DONE, ITEM CLOSED. 358.4 ALL: PMB members to think on the issue of Tier-2 accounting over the next day or so, and decide by the end of the week on a concrete proposal to let the experiments decide on the 0s and 1s. DONE, ITEM CLOSED. ACTIONS AS AT 21.09.09 ====================== 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. 350.5 JC to check and verify that the contact list on the GOCDB is up-to-date - to be done by September. 354.1 JC to get more info on e-NMR status and report-back; JC to also raise this issue of GridPP support for them at dTeam. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. 355.4 JG to do a draft Agenda for the e-science review visit. 356.2 In the context of the e-Science Review document, re the STEP'09 note and draft distribution rates - was it possible to put these numbers into perspective? RJ to provide DB with targets/rates context for STEP'09 and draft distribution rates; RJ to provide appropriate wording on figures meeting the requirements for Tier-1 running, eg: 'these figures exceed the requirements of the Tier-1 for initial running' - or something similar ; RJ to provide DB with info on Tier-2 numbers, ie: how many Tier-2s were there. 356.3 In the context of a discussion on HEPSPEC06 benchmarking, there were issues of having enough data, and the different way used to calculate hours, also he comparison between HEPSPEC numbers compared with prior SPECINT values. DB to discuss the issue of HEPSPEC06 benchmarking with SL and JC offline, and raise an appropriate action following discussion. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. 358.2 GP will talk to LHCb and see if they can progress the issue of CASTOR 2.1.8, and come back to us. We would require a strong plea from LHCb that they want this by December. 359.1 In the context of GridPP sustainability as highlighted by the OC, JG to circulate the EGEE document on cloud computing for further discussion. TC to provide the relevant urls from the CHEP talks. 359.2 In the context of hardware pledges and figures, JG to email Tony Medland and give him a heads-up that figures were coming. 359.3 SL to convene an Accounting & Benchmarking sub-group, comprising SL, JC, and the four x Tier-2 Co-ordinators, to meet on 2nd October to discuss the figures and follow the action plan as outlined below: --------------------------- >From the Deployment Board: 05.03 The Accounting & Benchmarking sub-group, comprising SL, JC, and the four x Tier-2 Co-ordinators, would meet on 2nd October to discuss the figures and follow the action plan as outlined above (see points 1-7 reproduced below, with the note following). --------------------------- DC suggested that setting up a sub-group to handle this specific issue would be useful. DB agreed, proposing the four tier-2 co-ordinators plus a couple of senior people (including SL) to moderate - there should be a proposal to go to the Deployment Board. DC proposed the following course of action: 1. specint them all 2. calculate what we can 3. adjust the ones we can't 4. compare the adjustment with those who haven't done it properly 5. if within 10% then ok 6. set-up a sub-group comprising JC, SL and the four Tier-2 Co-ordinators 7. agree timescale Figure was £400k this financial year, from STFC. One month only could be allowed for convergence, as time was short - proposed date was Friday 16 October. SL advised that it should not advantage sites who can't do it. Decisions should be referred to the PMB. This was all agreed. --------------------------- 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). 05.04 dTeam to publicise the 1st October as the changeover date to HEPSPEC06. --------------------------- 359.5 GS, RN, DC, LB (experiment reps) each to contact Neasan O'Neill, advising where, specifically, the GridPP website should point to for each of their experiments, in terms of user support information. 359.6 SP to ensure that Neasan O'Neill updates the GridPP website accordingly (once experiment reps have provided info as to where the GridPP website should point to for each of their experiments, in terms of user support information). The meeting closed at 2.05 pm. The next PMB would be held on Monday 28 September 2009 at 12:55 pm.