GridPP PMB Minutes 355 (10.08.09) ================================= Present: David Britton (Chair), Dave Colling, Jeremy Coles, Roger Jones, Sarah Pearce, John Gordon, Tony Doyle, Andrew Sansum, Glenn Patrick, David Kelsey, Andy Richards Apologies: Pete Clarke, Robin Middleton, Tony Cass, Steve Lloyd, Neil Geddes 1. NGI Bidding Document ======================== DB noted that there were a number of EGI global tasks that could be bid for by NGIs. JG advised that this related to global tasks like accouting, the GOCDB, and security co-ordination. JG was going to bid for these. Re the circulated document, DB advised that Section 5 had a table of EGI global tasks, ~20-30 of them. It was understood that RAL would be bidding for OE1, OE2, OE15. DB asked whether any other tasks were going to be bid for by other institutions? JG observed that it wasn't obvious that CERN would bid at all. Global tasks should be 100% funded. DB advised that OE11 co-ordination ought to be CERN. JG noted that he would talk to CERN and see what they were doing. DB asked about OE8 - what was happening about that? JG noted that he would need to ensure that David Fergusson was working on that. SP advised that she understood dissemination was assumed to be done from Amsterdam. JG advised that dissemination, co-ordination, and training, were three distinct parts - JG would speak to David Fergusson. It was noted that the deadline for bidding was August 24th. ACTION 355.1 JG to speak to CERN and find out what they intended to bid for; JG would speak to David Fergusson and check the situation re bidding for training etc. DB asked Andy Richards if there were any more inputs required? AR noted that the table on p10 needed to be completed; the tasks started on p3. DB advised that we could look at the integrated effort to cover the tasks but we had to base this on realistic effort - table 2 asked for resources, so it should scale. JG noted that they wanted inputs mainly. DB advised that we needed to put numbers in as best we could - this needed circulated for discussion as the breakdown was difficult. AR advised that people needed to have a guess at this, it was the same for NGS. JG noted that a lot of the GridPP effort that already existed was WLCG, in relation to international tasks. TD noted the areas 'user technical support' and 'international VOs' - weren't CERN going to do that? JG advised that they were bidding for an SSC to support the LHC, but they may not bid to do global EGI things. ACTION 355.2 ALL - to provide inputs or metrics to JG & AR (looking at pp9-10 specifically, and sending comments to JG & AR) - p4 onwards provides the relevant info. 2. e-Science Review ==================== DB reported that a letter of invitation to GridPP had been circulated regarding the 9th December meeting. GridPP had been invited to give input - we can give it in our own form, and it is intended to be a brief document. SP noted that the draft document she circulated had not been progressed any further. DB advised that we need to look at the terms of reference and give them coherent answers they can use for their pro-forma. Some issues were harder to address - we needed to tackle the ones that we could. There were about 10 bullet points in total - current material needed to be re-organised to fit and the rest of the document needed to be addressed. SP agreed that she would re-organise it - was there anything missing? DB noted that it looked ok. SP agreed to re-organise it then give it to DB to finish-off. It was noted that this was required by Wednesday and DB would circulate. It was agreed to try and get this done by the end of the week. TD noted that the coverage was good but the structure could be mapped, and the pro-forma also needed to be done. TD advised that none of the current LHC computing would be possible without this, an obvious point. SP to re- organise the document into headings based on the Terms of Reference. DB to take-over the document on Tuesday and issue a version for comment on Tuesday night. ACTION 355.3 SP to re-organise the e-Science review document into headings based on the Terms of Reference and forward to DB for further inputs, by Tuesday. JG asked about feedback from the visit? DB asked about CMS? Was Bristol being invited? JG noted that there was info on who had been allocated where. DC advised that he needed to discuss whether he or Dave Newbold was dealing with it - they need to talk about it - there was confusion over PIs and those invited. JG noted that an assessment was being handled separately at RAL or there might be one overall. RJ suggested that we need to co-ordinate but that feedback may be required individually relating to project money spent. JG advised that he could change the programme as required. DB suggested that particle physics be kept together. JG noted that it would all be plenary, which meant seeing more reviewers - one presentation overall on EI was possible. DB agreed that it would be preferable for EI to be treated as a common thread across all projects. JG would do a draft Agenda. ACTION 355.4 JG to do a draft Agenda for the e-science review visit. 3. Risk Register ================= SP had circulated the updated version of the Risk Register. SP noted that we need to look at the high level risks and whether any other medium level ones need to be increased. SP had circulated the updated version of the Risk Register. SP noted that we need to look at the high level risks and whether any other medium level ones need to be increased. R1 - this seemed less critical than six months ago; it was agreed to reduce it to 3, 3 = 9 overall R5 - was this a bit better than last time? DB noted yes, it should be reduced; CASTOR was more stable; JG noted that it also addressed resilience issues. It was agreed to change this to 2, 4 = 8 overall R7 - the move to the new building was now past - DB asked whether it was likely to fail to meet the pledge due to disk issues? AS noted they couldn't meet the pledge at present - R7 was therefore a 4, 2 = 8 overall. However, it was later decided to delete this risk, and incorporate issues raised with the disk as part of R9 (procurement issues). R9 - AS advised that this related to a failure to deliver resources on time; this could be a small issue or large, due to non-compliance on EU tenders, which would mean missing a tender round; however it was low impact at present. DB observed that R9 should refer to large impact issues with the next hardware procurement, and suggested that we leave it as is. Later discussion resulted in disk issues being included here, with the risk rising to 4,2=8. R11 - AS noted that the impact on the project is bigger than 1. DB suggested changing this to 3, 2 = 6. Agreed. R12 - It was agreed to change this to 'Machine Room problems compromise the Tier-1'. Current issues with the machine room mean that this should be 4,3 = 12 overall. R13 - could this remain as it was? It was felt that the 'future options' should now be moved to 'current process'. Agreed. R14 - AS asked whether this was inconsistent to the case put to the OC? DB noted that the risk was more likely now, and should be changed to 2, 4 = 8. JC noted we could look at the number of breakages. DB observed that we could re- do the stats. R15 - DB advised that this related to general charges, so should be left. R16 - this related to middleware. JC noted that the DPM stuff was missing. TD noted that monitoring tools needed to be sufficient to measure high load. DB suggested 1, 3 or 2, 2. 2,2 was agreed. R17 - this was the same, the move to EGI meant that uncertainty had increased. It was agreed to leave at present. R21 - GP noted that this should have improved. AS agreed - from the Tier-1 viewpoint they had sufficient effort. It was agreed to leave as is. R24 - JC noted that work fluctuates. DB advised that looking forward it should decrease. It was agreed to leave as is. R27 - DB had some discussion re STFC and GridPP4 - he advised that we leave this at present. R28 - it was agreed that this was now 'beyond' the blueprint, 'blueprint' should be omitted but EGI left in. DB suggested alternative wording: 'European Grid Initiatives' covers everything; things were proceeding as expected and the risk was not elevated. Agreed to leave as is. SP asked if anything was missing? Nothing. SP advised that email inputs were welcome, especially from those categories that had changed. It was noted that related texts needed to be updated by the end of this week. 4. OPN Backup Link =================== DB advised that there had been more questions from the OC. DB had drafted provisional responses and circulated them. Could we agree the wording and get it sent off? DB went through the 3 questions and proposed responses. 'Foolish' was changed to 'unwise'. DB would send the formal response to Trish Mullins. ACTION 355.5 DB to send email response re OPN backup link questions, to Trish Mullins. 5. Week's Notes ================ - Expts requested to provide CPU RAM requirements [GP email 5/Aug] DB noted that there had been a request from GP re a forward look at RAM requirements - this was a work-in-progress. - LHCC draft report circulated It was noted that the LHCC report was available but had not been widely circulated. No table had been provided and DB had sent some queries to Nick as the figures did not make sense - a reply was awaited. - LHC News It was noted that injection would be a total of 7 TeV per beam initially. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric ------ 1) Acceptance tests of the CPU is essentially complete and will be completed this week. 2) Acceptance test of one lot of disk is 70% complete and we expect to reach 90% of disk deployable within 7 days 3) Acceptance tests of the second lot of disk has encountered a problem. We have no extimated completion date at the moment. We will be meeting the supplier on Tuesday (11th). 4) New procurements have started. Disk procurement PQQ stage is running. CPU PQQ is almost ready and will start shortly. 5) Michael Jouvin is visiting this week to help our work on QUATTOR. He will be giving a short introduction to Quattor which is also open to Tier-2 staff. This has been announced via HEPSYSMAN. 6) There will be a test of the UPS system scheuled for September. We have received a proposed date and it will be fixed and announced shortly. It is expected that we will run "at-risk". Staffing -------- 1) The first experiment support post will start today (Andrew Lahiff). The second post has interviewed and an offer is being prepared. 2) The EGEE PPS recruitment offer has been accepted and the candidate is expected to start on 17th August. 3) The YII student (funded by ESC) has started. Service ------- 1) SAM availability for the OPS VO was 100% however overall we have seen increasing service instability overthe last couple of weeks with an increase in the calllout rate and problems impacting the experiments. 2) CASTOR There were problems on the ATLAS instance on Thursday (6th) after a tuning change in the number of jobmanager threads (introduced to improve our SAM test success rate) had a severe impact on database access performance after the database shifted to a new access mode. 3) There were problems on the LFC on Friday (ALARM ticket). A restart resolved them but the cause is unknown. 4) wms02 broke on Saturday and could not be restarted by on-call. Problem was resolved on Monday. 5) the Tier-1 will hold a "working at home day" on Wednesday 19th August. The weekly experiment Liaison meeting is scheduled to proceed, but only via EVO. This is part of our Swine Flu preperation. 6) Development work is scheduled throughoutthe month to upgrade the 3D, LFC and FTS hardware (and seperate out the ATLAS LFC). most of this work will be carried out "at risk". A downtime has been proposed for 26th but we are investigating possible alternatives after ATLAS requirements changed. 7) Plans for migration to SL5 have been proposed to the experiments. This work will also incorporate a move to QUATTOR for SL5 deployment. The highest impact component of this work will probably take place during w/b 14th September by which time we expect to have a production quality SL5 service. There was a discussion on QUATTOR procurement, and upgrade to SL5. AS noted that a background paper on QUATTOR was available, which had been given at the Strategy Meeting. SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that he was currently catching up - things were quiet at the moment. The Hammercloud tests would continue on Tuesdays - they were slowly improving configurations at sites. Tier-1 issues over the past week had been the WMS, LFC, and disk loss. AS agreed there was a period of instability at present. RJ reported that the OC report was in hand, he needed to add STEP numbers to it, etc. SI-3 CMS weekly review & plans ------------------------------- DC reported that current problems were kernel-related - jobs started on 32-bit kernel on a 64-bit machine - it was a known problem. They were awaiting more jobs to test the machine. The Tier-1 had done well; Imperial had a dCache problem, with servers dropping out recently; re the UI on the LCG CP, there was a bug and an incompatibility with submitting jobs to the Crab instances - the work was ongoing - this was also a known problem. SI-4 LHCb weekly review & plans -------------------------------- GP reported as follows: 1) Restarted production last week, after new disk servers became available at CERN and all failover transfer requests had finished. The pending productions were started last Thursday and finished on Sunday (including the 10**9 minimum bias run), after quickly ramping up to > 18K simultaneously running jobs. 2)Bugs found and fixed within DIRAC, relating to job prioritisation. 3) Various Tier-1 sites (not RAL) ran out of storage in the MC-M-DST service class last week. More storage was quickly put in by those sites when alerted by GGUS tickets. 4)lcgwms02 at RAL problems over the weekend. This caused various Monte Carlo simulation jobs to fail - primarily at Bristol. Outlook: User analysis and further MC productions being prepared. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) The Tier-2 testing using ATLAS Hammercloud submissions is continuing. These tests aim to find out the number of concurrent jobs that a site can handle before throughput and efficiency become an issue. Much is being learned on a site-by-site basis even though there have been problems with maintaining job pilot levels (for example due to pilot framework limitations, operator problems and server hardware). The tests now run for a couple days each week and participating sites are being asked to provide a summary plot showing how their site efficiency scales with the number of jobs running. 2) A version of the regionalised GOCDB4 code is ready to be tested. This will be done at RAL on NGS hardware. The test is currently of the deployment mechanism but later will include the introduction of additional GOCDB data fields. At the moment NGS has some requirements in this area while GridPP does not. Is there any GridPP site/regional related data not currently collected that should be put forward as a requirement? [This question has been asked previously so this time it is just to check that there are no new requirements]. 3) The EGEE reliability report for July has been published. UKI figures are 90% for availability and 93% reliability. There are only a few sites whose performance has pulled down the overall figure. IC-HEP has gone from 89% availability in June to 69% in July. QMUL from 94% to 71%, RHUL from 97% to 71% and UCL-CENTRAL from 77% to 5%. Cambridge has dropped from 93% TO 85%. Explanations are currently being gathered but half are thought to be due to old hardware (CEs) and the manpower available to fix problems. 4) Camont has been running more tests for their ngram work. This seems to be uncovering issues that need further investigation and they have been requested to provide more detailed information on particular issues such as "the WMSs used were temperamental at best". No problems are being seen with respect to the levels of site traffic which is good news. While on the topic of non-LHC/physics VOs I should mention that I have no response yet on the e-NMR VO mentioned at the last PMB. 5) The SL5 status remains unchanged in terms of currently deployed resources but several sites are expected to migrate by mid-September. The Glasgow move is of particular importance since it will be used to debug ATLAS submissions to SL5 queues. While other sites may wish to wait for the problems to be resolved first at Glasgow, we do really want other sites looking at SL5 also on this timescale in order to avoid upgrades shortly after data taking starts. 6) The TPM task looks set to change as there is a transition from EGEE to EGI. The suggestion is that 4 teams undertake the task in rotation for the remainder of EGEEIII. Our Tier-2 coordinators who currently perform this task for UKI have responded with concern about the increased workload if UKI were to form one of these teams. What are the PMB expectations and priorities in areas like this one? SI-6 LCG Management Board Report --------------------------------- DB reported that there had been a presentation on Storage issues; Information on dCache migration to Chimera was circulated. AOCB on CERN at TELECON2009 in Geneva; Doesn't directly affected us but noted to DC that RTM did not appear in list of possible demos. A letter had been sent out encouraging all sites to upgrade to SL5. SI-7 Dissemination Report -------------------------- SP reported that Neasan O'Neill had been working on planning EGEE'09, and had published a news item on NGS. REVIEW OF ACTIONS ================= 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. AS had sent a reminder - no reply as yet. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. 350.5 JC to check and verify that the contact list on the GOCDB is up-to-date - to be done by September. 351.3 ALL: drafts of all reports for the OC to be sent to DB by 10th August (project status papers) and 17th August (all other papers). 354.1 JC to get more info on e-NMR status and report-back; JC to also raise this issue of GridPP support for them at dTeam. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. 354.3 JG to provide a brief summary of the R-GMA situation to Ian at the MB. Done, item closed. 354.4 DB to co-ordinate 16-20 page Project Status report for the OC and ensure it is submitted on time. 354.5 DB to write a 1-page Introduction for the OC Project Status Report. 354.6 TC to write a 1-page report on LCG Status for OC Project Status Report. 354.7 RM, with input from NG, to write a 1-2 page report on EGEE/EGI Status for the OC Project Status Report. 354.8 AS to write a 2-4 page report on Tier-1 Status, for the OC Project Status Report. Done, item closed. 354.9 JC, with input from SL, to write a 2-4 page report on UK Deployment, for the OC Project Status Report. 354.10 TD to write a 1-page Technical Director's report, for the OC Project Status Report. 354.11 RJ to provide a 2-page User Report, to include relevant figures, on behalf of ATLAS, for the OC Project Status Report. 354.12 DC to provide a 2-page User Report, to include relevant figures, on behalf of CMS for the OC Project Status Report. 354.13 GP (& RN) to provide a 2-page User Report, to include relevant figures, on behalf of LHCb for the OC Project Status Report. 354.14 GP to provide a 1-page User Board Report, for the OC Project Status Report. 354.15 SP to provide a 1-2 page summary report on EI/KT & Dissemination, for the OC Project Status Report. 354.16 SP to provide the ProjectMap Report, to include the Risk Register, for the OC Project Status Report. 354.17 SP to provide the Resource Report, for the OC Project Status Report. 354.18 The Case for OPN back-up link, already completed by PC, to be included in the OC Project Status Report - DB to do. 354.19 AS to provide a specific progress report on CASTOR, a few pages long, as requested explicitly by the OC, for the OC Project Status Report. Done, item closed. 354.20 JG to provide a report on EGI/NGI/NGS and future scenarios (point 35 and Action from last OC meeting), for the OC Project Status Report. ACTIONS AS AT 10.08.09 ====================== 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. AS had sent a reminder - no reply as yet. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. 350.5 JC to check and verify that the contact list on the GOCDB is up-to-date - to be done by September. 351.3 ALL: drafts of all reports for the OC to be sent to DB by 10th August (project status papers) and 17th August (all other papers). 354.1 JC to get more info on e-NMR status and report-back; JC to also raise this issue of GridPP support for them at dTeam. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. 354.4 DB to co-ordinate 16-20 page Project Status report for the OC and ensure it is submitted on time. 354.5 DB to write a 1-page Introduction for the OC Project Status Report. 354.6 TC to write a 1-page report on LCG Status for OC Project Status Report. 354.7 RM, with input from NG, to write a 1-2 page report on EGEE/EGI Status for the OC Project Status Report. 354.9 JC, with input from SL, to write a 2-4 page report on UK Deployment, for the OC Project Status Report. 354.10 TD to write a 1-page Technical Director's report, for the OC Project Status Report. 354.11 RJ to provide a 2-page User Report, to include relevant figures, on behalf of ATLAS, for the OC Project Status Report. 354.12 DC to provide a 2-page User Report, to include relevant figures, on behalf of CMS for the OC Project Status Report. 354.13 GP (& RN) to provide a 2-page User Report, to include relevant figures, on behalf of LHCb for the OC Project Status Report. 354.14 GP to provide a 1-page User Board Report, for the OC Project Status Report. 354.15 SP to provide a 1-2 page summary report on EI/KT & Dissemination, for the OC Project Status Report. 354.16 SP to provide the ProjectMap Report, to include the Risk Register, for the OC Project Status Report. 354.17 SP to provide the Resource Report, for the OC Project Status Report. 354.18 The Case for OPN back-up link, already completed by PC, to be included in the OC Project Status Report - DB to do. 354.20 JG to provide a report on EGI/NGI/NGS and future scenarios (point 35 and Action from last OC meeting), for the OC Project Status Report. 355.1 JG to speak to CERN and find out what they intended to bid for; JG would speak to David Fergusson and check the situation re bidding for training etc. 355.2 ALL - to provide inputs or metrics to JG & AR (looking at pp9-10 specifically, and sending comments to JG & AR) - p4 onwards provides the relevant info. 355.3 SP to re-organise the e-Science review document into headings based on the Terms of Reference and forward to DB for further inputs, by Tuesday. 355.4 JG to do a draft Agenda for the e-science review visit. 355.5 DB to send email response re OPN backup link questions, to Trish Mullins. TIMELINE FOR OC DOCUMENTS: [Aug 17th: First complete draft of all other papers] Sep 1st: Final version of papers available Sep 7th: F2F at Cambridge and OC papers submitted. Sep 15th: OC at MRC in London FURTHER PMB MEETING DATES: [Aug 17th - cancelled] Aug 24th - PMB Aug 31st - cancelled (UK Bank Holiday) Sep 2nd - special PMB to address issues re OC meeting Sep 7th - F2F at Cambridge The NEXT PMB would take place at 12:55 pm on Monday 24th August.