GridPP PMB Minutes 372 (11.01.10) ================================= Present: David Britton (Chair), Sarah Pearce, Andrew Sansum, Tony Doyle, Jeremy Coles, Pete Clarke, Steve Lloyd, Roger Jones, John Gordon, David Kelsey, Tony Cass, Glenn Patrick, Dave Colling, Neil Geddes (Suzanne Scott, Minutes) Apologies: Robin Middleton 1. GridPP4 Proposal - Status of Contributions ============================================== DB advised that the plan today was to go through the status of the GridPP4 proposal and he reminded that a rough complete draft was required by ~tomorrow, but certainly in good time before the Friday F2F in Glasgow. Section-5: SP had sent her section to DB. DB noted a possible change required in the bullet points as they took up a lot of space. SP could shorten this. Section-6.1: GP was about to contribute some text following the table, relating to non-LHC experiments. GP would do this today/tonight. Section-6.3: GP confirmed he would also do this today. Section-6.4: PC had sent a draft to DB on network requirements. Again it was noted that the bullet points would need to be condensed. Section-7.1: AS had forwarded a draft. DB noted he had inserted this into the document and it looked like it was exactly as required. AS could keep the token until Wednesday in order to insert further figures. The default was that everyone would keep their token until told otherwise. It was noted we would need to re-examine the 'below the line' figures for the final version. Section-7.2: DB had written a description of the Tier-2 manpower and principles behind its distribution, and a spreadsheet also showed potential allocations to Institutions - this would be discussed on Friday. DB noted that all information was confidential at present until agreed and put foward to the CB etc. It was noted that the text and table had been sent to RJ, as ATLAS was the biggest stakeholder. RJ had indicated that he was happy with the approach. ATLAS- specific effort would be allocated to 'green' sites. RJ advised that an holistic approach was important overall. DB would re-send a new version tomorrow, he highlighted that the text rationale was important and the table should reflect it. DB had also sent the documents to DC - did he have any comments? DC thought the section well-written and it handled the issues with delicacy - he was happy with it. He would send minor text amendments to DB. SL advised that he would add a table on existing resources to the Tier-2 hardware document. SL would send this to DB early tomorrow. DB asked whether we needed to mention the MoUs? SL noted they needed to be updated. TD asked whether they would be formalised in the same way? DB noted we were giving them posts. RJ noted that posts were going to sites rather than Tiers. SL pointed out that the MoU had never been tested. TD advised that it would be better if they were tied to the grants. SL advised that a condition could be added later once the hardware was allocated. DB would send a spreadsheet to SL. Section-8: JC had sent text which DB had added to the document. JC wrote the overall draft but it needed input from JG on NGI. How did the additional EGI posts fit in? JG advised that EGI expected 6 people from the UK to do things towards NGI, so we needed to define these 6 people and we would received funding for 2 posts (which were International Task posts). DB summarised that these 6 posts were part of the UK NGI and there was a strong overlap with GridPP. In his model DB had split the 2 EGI posts into 4 halves where Institutes had deputy Tier-2 Co-ords and also knew how to claim. JC should assume Glasgow, Manchester, Imperial & Oxford. We would receive 0.5FTE from Europe and report 1.5FTE back. JG gave JC information on the GOC posts. Section-9: GP had sent a draft re the experiment-specific support work package D. DB had added this. Section-10: TD would provide a paper on Wednesday afternoon. Section-12: SP had text on 'Impact Planning' but she was currently checking STFC guidance on what should be included in this. Section-13: TD had changed the title of this section and had written some text - he would provide this to DB on Wednesday morning. Section-14: SP, DB & AS had been working on this and were almost at convergence. Taking into account the prevailing uncertainty regarding funding etc, this was largely within budget guidelines provided by Tony Medland. It was noted that the Working Allowance and Contingency would be dealt with afterwards once the other sections were complete. Risk Register ------------- SP wanted to start from scratch with this and she asked everyone to bring their 'top 3' risks to the F2F. This should be high-level risks for discussion, for the GridPP4 project. It was noted that Working Allowance and Contingency need to be covered within the risks. RJ, DC, GP - experiment side AS - Tier-1 side JC - Production etc ACTION 372.1 ALL: to bring their 'top 3' high-level risks for GridPP4 to the Friday F2F. Areas of key responsibility should be considered (rather than risks across the project overall). If everyone could also provide a 'top 12' overall, but 'top 3' specifically, that would be preferred. DB would circulate a draft tomorrow afternoon with what he has - he needs to receive contributions by 1pm, and this version would be circulated for discussion on Friday. 2. Changes to OPN Provision ============================ PC advised of changes to the costs for the resilient link. From 01/08/2010 onwards the previous quote would change from £40k to £58k for a 10gig resilient link. The previous per annum quote had missed out an element and would take a diverse route through GEANT. PC advised that it was actually more expensive for a 4 x 1 link than for a 10 x 2 link. Were the PMB happy with the increased cost, which would be ~£18k recurrent from 01/08/2010 (which was the STFC new financial year)? PC added that there was no change to the cost of installation or procurement. Furthermore, PC reported that at present the cost of the existing 10gig link (a per annum recurrent cost ending 31/07/2010) was 'hidden' within the JISC agreement with the Research Councils, but this was likely to become visible/transparent even although the overall cost was less at £40k - it was likely that this would have to be funded by the project. DB advised that he had been speaking to Tony Medland recently and would bring up this issue with him. Could we use some of the returned money to fund the cost of the existing link? DB felt that an increment of £18k was ok in the context of £2 million on hardware p.a. The issue of funding the existing link was a smaller overhead, which we should probably just pay for - it was better to fight larger battles. There was a discussion on RCUK JISC funding and where the cost would be applied (ie: to STFC). PC would speak with Robin Tasker. It was noted that the previous existing link was paid for at RCUK JISC level but now the STFC/project should pay more directly. DB would need to receive any relevant information from Robin Tasker by Wednesday morning. 3. Week's Notes ================ Multi-user Pilot Jobs Questionnaire - It was noted that this issue would be raised by JC within his Production report. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported that the installation of the disk servers had gone well over the Christmas period - this should now be a 'solved problem' as acceptance tests were so far being passed. Drives were being replaced. AS would circulate a timeline. ACTION 372.2 AS to circulate a timeline for disk server installation completion and drives replacement. AS reported that the UPS situation was progressing - switch-over to the bypass went smoothly and the 'noise' went away. He was due to have a meeting with the electrical people today at 3pm; cable calculations were ongoing. Switching was not considered to be a major risk. The raid arrays were ongoing. SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that things were quiet at present; they were doing reprocessing at the end of the month. SI-3 CMS weekly review & plans ------------------------------- DC reported that everything was going smoothly, however Bristol had been a liability for the past two weeks. The Tier-1 was looking good. SI-4 LHCb weekly review & plans -------------------------------- GP reported that things were quiet; there was an ongoing attempt at understanding the disk server crashes at RAL. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) UKQCD are keen now to progress with their SRM interface and have requested enablement with 1TB disk either at Edinburgh or Glasgow. Our current recommendation is Glasgow, as Edinburgh will soon move to STORM and is more complicated in terms of resource usage (though they may reach an agreement internally). At the level of 1TB and to take advantage of the different testing environments, enablement at both sites is likely to happen. 2) camont have seen large fluctuations in the number of jobs that they can run. This is possibly an affect of fairshares coming into play, but warrants further investigation as over the last week utilisation of CPU has been low. 3) The WLCG technical forum has today circulated a request for sites to read a wiki-page summary of multi-user pilots and then answer a questionnaire on usage: https://wlcg-tf.hep.ac.uk/wiki/Multi_User_Pilot_Jobs 4) SA3 + JRA1 in EGEE have proposed rapid deprecation of old releases (2 months). The current feeling is that this is too fast, but some clear timelines would be useful. This topic also arose in discussion of operational metrics last week as we do not currently have a good measure of how well sites are keeping up with middleware releases (though we do know that schedules often slip – SL5 WNs being the latest example). SI-6 LCG Management Board Report --------------------------------- DB noted there was a meeting taking place tomorrow. SI-7 Dissemination Report -------------------------- SP noted nothing urgent to report; Neasan O'Neill would be posting some news items soon - any info should be sent to him asap. DB suggested he send a reminder round. REVIEW OF ACTIONS ================= 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. JC noted this is not possible to complete because the tests themselves were not valid for a period during Q3 and also the start of Q4. This item is now closed, as the metric has become invalid. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There were a few suggestions in the deployment team about how to progress in this area. It needs writing up and an implementation plan. Ongoing. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. SP reported there had been an email exchange and she had sent suggestions on how to move forward. JC had met with Andy Richards at RAL. One of the issues was uncertainty in relation to funding - SP needed more detail re resources and options for the future, also for EGEE-funded manpower at present and what we have signed up to in NGI. She was awaiting a response. DB noted that NG was trying to understand the proposals and the funding fractions. In defining GridPP4 we needed to define these posts and responsibilities. JC noted that JG had suggested going through the EGI proposal document for info. A meeting had been held before Christmas re a transition plan. SP was awaiting a skeleton outline plan from RM, allocating people to sections. Ongoing. 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- JC would check the situation with Graeme Stewart (who was currently on annual leave). JC followed up with Graeme and the other experiments. A test was started but this area has been deemed low priority and further progress is not expected for some time. ATLAS see no issues with contention. LHCb are not intending to pursue anything in this area. A CMS discussion has started but again it does not appear to be urgent. If the experiments are not pushing this internally then there is nothing for the deployment team to follow up! It was noted there was no priority in ATLAS at present, this will be pending for a while. Move to inactive as it is a long-term action. 361.3 JC and AS to check Tier-1 and Tier-2 gstat2 results (in relation to SL5 having been discussed at the GDB). JC reported that he had checked this, but information from AS was still awaited. JC noted that there were issues with the results. JC noted that the action as worded was done weeks ago. It is an ongoing deployment team action on the Tier-2 coordinators to get the information corrected. JC action done. AS still to respond. Tickets were now being produced. Action closed. 366.2 RJ to provide ATLAS HW requirements for 2011-15. RJ & DC had a preliminary discussion - they need to agree common profile, even if it is flat cash. Done, action closed. 366.3 DC to provide CMS HW requirements for 2011-15. In progress. Done, action closed. 366.4 GP to provide LHCn HW reqiremens for 2011-15. GP had started this. DB noted he needed the numbers for hardware costings and needed something soon to begin work. The deadline was 2 weeks for prelim. numbers. AS would look at them as well. Done, action closed. 366.5 SL/DB to estimate what fraction of STFC funding goes to non-LHC groups. What about the theory side? Done, action closed. 366.6 GP to invite input from Other Experiments. Done, action closed. 366.7 DB (in consultation with AS) to provide HW-cost estimates for 2011 - 2015. DB was awaiting inputs. Done, action closed. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. AS noted this depended on money costs. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? Early action on AS required. AS had sent tech questions round the team and would forward inputs when available. DC noted to the meeting that today was the 16th Nov - only 4 weeks remained until Imperial, by which time we needed to have made extensive progress. To be discussed at the F2F on Friday. Ongoing. For the next 3 actions, discussion has taken place - if GridPP wish to use tape, they do not object. All 3 actions closed. 366.9 RJ to confirm that ATLAS supports the use of Tape storage in the period 2011-2015. RJ noted they had a belief in the archival work but the cost was to be provided by the provider. Tape would have a front-end staging system. DB asked whether they might want to move to another model? We should not assume that tape will do. A statement was required. Done, action closed. 366.10 DC to confirm that CMS supports the use of Tape storage in the period 2011-2015. Done, action closed. 366.11 GP to confirm that LHCb supports the use of Tape storage in the period 2011-2015. Done, action closed. 366.12 SP to liaise with AS to establish non-capacity costs. SP advised that discussions had started. DB noted a long-term question about the model. Done, action closed. 366.13: SP to request and collect first cost estimates of posts for GridPP4. FEC and non-FEC posts need to be costed. The Tier-1 posts should be costed as accurately as possible as soon as possible since there is a large lever arm here. Done, action closed. 366.14 DK to provide first estimate of average RAL post cost on the basis of the current distribution of posts/grades. Clearly this will need refinement once we understand the final mix better. DK had already started this and would have estimates later this week. Done, action closed. 367.1 ALL: to send email responses/thoughts to DB, or to the list, on NGI issues discussed. Done, action closed. 367.2 RM to fill-in the grey boxes on DB's UK NGI diagram of a minimal NGI, as to what NGS would be doing in the areas listed. Ongoing. 367.3 JG to contact Ian Bird directly, immediately, and ask for a clear formal statement in relation to multi-user pilot-jobs by the experiments. This formal statement was required immediately - we could not wait for this issue to be brought up at the next MB. This was done and a survey is now being carried out. Done, action closed. 367.5 DB to send formal information round the community re multi-user pilot- jobs, once clear statements had been received from Ian Bird (via JG) and the experiments (via JC). Done, action closed. 367.6 RJ to submit a proposal to the PMB for funding assistance for the next ATLAS tutorial. Done, action closed. 368.1 DB to circulate an initial informal paper on NGI Interface in advance of the upcoming F2F in order to form a basis for further discussion. Done, action closed. 368.2 DB to circulate an initial informal paper on Tier-2 Structure in advance of the upcoming F2F in order to form a basis for further discussion. Done, action closed. 368.3 SP to circulate an initial informal paper on Project Management in GridPP4 in advance of the upcoming F2F in order to form a basis for further discussion. Done, action closed. 368.4 SP to circulate an initial informal paper on Economic Impact, Knowledge Exchange and Dissemination in advance of the upcoming F2F in order to form a basis for further discussion. Done, action closed. 368.5 DB/AS to circulate an initial informal paper on Hardware Requirements in advance of the upcoming F2F in order to form a basis for further discussion. Done, action closed. 368.6 AS/DB to circulate an initial informal paper on Tier-1 Role and Requirements in advance of the upcoming F2F in order to form a basis for further discussion. Done, action closed. 368.7 TD to circulate an initial informal paper on Technical (Middleware) Support in advance of the upcoming F2F in order to form a basis for further discussion. Done, action closed. 368.8 JC to circulate an initial informal paper on Deployment Support in advance of the upcoming F2F in order to form a basis for further discussion. Done, action closed. 368.9 GP to circulate an initial informal paper on Experiment Support in advance of the upcoming F2F in order to form a basis for further discussion. Done, action closed. 368.10 TD to circulate an initial informal paper on Cloud Computing in advance of the upcoming F2F in order to form a basis for further discussion. Done, action closed. 368.11 SP to circulate an initial informal paper on Financial Planning in advance of the upcoming F2F in order to form a basis for further discussion. Done, action closed. 368.12 ALL: comments on Tier-2 structure to be sent to DB. Done, action closed. 368.13 ALL: comments on Project Management to be sent to SP. Done, action closed. 368.14 AS to iterate with Gareth in relation to actions required for downtime communications. Done, action closed. 368.15 GP, DC & RJ to provide experiment input to DB/AS for 'Hardware Requirements' initial document for discussion at Imperial, which DB/AS would prepare. Done, action closed. ACTIONS AS AT 11.01.10 ====================== 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There were a few suggestions in the deployment team about how to progress in this area. It needs writing up and an implementation plan. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. A meeting had been held before Christmas re a transition plan. SP was awaiting a skeleton outline plan from RM, allocating people to sections. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. AS noted this depended on money costs. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? Early action on AS required. AS had sent tech questions round the team and would forward inputs when available. DC noted to the meeting that today was the 16th Nov - only 4 weeks remained until Imperial, by which time we needed to have made extensive progress. To be discussed at the F2F on Friday. 367.2 RM to fill-in the grey boxes on DB's UK NGI diagram of a minimal NGI, as to what NGS would be doing in the areas listed. 372.1 ALL: to bring their 'top 3' high-level risks for GridPP4 to the Friday F2F. Areas of key responsibility should be considered (rather than risks across the project overall). If everyone could also provide a 'top 12' overall, but 'top 3' specifically, that would be preferred. 372.2 AS to circulate a timeline for disk server installation completion and drives replacement. INACTIVE CATEGORY ================= 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- JC would check the situation with Graeme Stewart (who was currently on annual leave). JC followed up with Graeme and the other experiments. A test was started but this area has been deemed low priority and further progress is not expected for some time. ATLAS see no issues with contention. LHCb are not intending to pursue anything in this area. A CMS discussion has started but again it does not appear to be urgent. If the experiments are not pushing this internally then there is nothing for the deployment team to follow up! It was noted there was no priority in ATLAS at present, this will be pending for a while. Move to inactive as it is a long-term action. --------------------- DB noted he would organise a PMB dinner on Thursday evening and would circulate details. An Agenda for the F2F was required - SP would do this. The meeting would close at 4pm. All: to send departure times to DB. The next meeting would be a F2F at Glasgow on Friday 15th January 2010.