GridPP PMB Minutes 374 (25.01.10) ================================= Present: David Britton (Chair), Sarah Pearce , Andrew Sansum, Tony Doyle, Jeremy Coles, Pete Clarke, Steve Lloyd, John Gordon, Tony Cass , Glenn Patrick, Robin Middleton, Dave Colling, Neil Geddes (Suzanne Scott, Minutes) Apologies: David Kelsey, Roger Jones 1. GridPP4 Proposal: Status of v8.3 ==================================== DB advised that we needed to look at v8.3 of the proposal and check outstanding issues, and also look at the OC papers. The proposal was coming together but there were items still to do. p3 (4. International Context) This needed a bit of work. NG agreed to do it this afternoon. It was noted that this section sets the tone for the proposal, and it needed to flow. ACTION 374.1 NG to work on (4. International Context) to ensure correct tone and logic of flow. p4 (5. The UK Grid ..) DB felt that beneath the Figure the text was confusing in relation to EGEE and wLCG and percentages of totals. It was agreed to either relate them, or remove one of them. SP to do. ACTION 374.2 SP to rework the EGEE and wLCG percentages on p4 beneath the Project Map (5. UK Grid ..) and either relate them or remove one of them. p5 (5.1 GridPP4 ..) DB had highlighted three paragraphs - did these need re-written? NG had send an email and PC had also commented. It was agreed that NG and PC would iterate on this section. ACTION 374.3 NG and PC to iterate on the three highlighted paragraphs in section 5.1. p8 (6.2 Global Resource Requirements) DB noted that work was required re Appendix B - DB to do. ACTION 374.4 DB to work on Appendix B. p9 (6.3 UK Computing Resource Requirements) DB noted that a highlighted sentence needed to be better phrased and/or justified. AS would do this. ACTION 374.5 AS to better phrase and/or justify highlighted paragraph in section 6.3, p9. p10 (6.4 UK Network Requirements) DB noted the yellow highlights on network capacity and dates - info was awaited. It was agreed to change the wording to: 'during GridPP4' rather than specifying certain dates. DB to amend. ACTION 374.6 DB to amend the wording highlighted yellow on p10 (6.4 UK Network Requirements). It was agreed to change the wording to: 'during GridPP4' rather than specifying certain dates. p11 (6.5 UK Service Requirements) DB noted a paragraph here about service requirements - DB to sort this out and ensure it is correct. ACTION 374.7 DB to sort out paragraph on service requirements (p11 - 6.5 UK Service Requirements) and ensure it is correct. p12 DB advised that the first comment on p12 was now done. In section 7.1 - WP-A: PC suggested removing the highlighted paragraph. DB advised that he had done a paragraph in detail as he wanted this on record as a process which GridPP always carried out. There ensued a discussion on the paragraph inclusion and the buffer issue. AS suggested moving the para to the Resource section. p16 (7.2 WP-B) DB noted the problem area here was it's relationship to section 8 - section 7.2 didn't work at present. There ensued a discussion about the title of the 8 x Site Leaders, various suggestions had been put forward to DB. After much debate and many suggestions, it was decided to drop the title (Site Leaders) and just refer to them as the core operations team members, which is (one of their) roles and not a title. p21 (8.1 The Operations Team) DB asked what these eight items were? JC advised they were tasks divided between the eight people. These were the things that we needed to do as core operations, however he did not mean a post to get mapped to a task on the list. JC advised that the Production Manager has oversight of all of these issues. DB asked about the Security Team - we do have a security officer to do this at NGI level, but someone is needed for the GridPP sites? Did monitoring and accounting overlap with GOC? JG noted that accounting was ensuring publishing was being done ok. DB also noted that documentation overlapped with documentation later on. PC noted that these 8 roles were justified in Section 2 - you need two people to run a Tier-2 as part of that role. One person from these sites would be involved in these tasks as well. DB thought this a matter for internal management rather than being a specific part of the proposal. DB would go over the section again and circulate a new version. JG noted that sites had to understand that two people weren't there solely to keep the site running. DB thought that this section could also be moved to Appendix A. JC would send answers to the comments raised. DB would revise the section. ACTION 374.8 JC to send answers to the comments made about section 8.1. ACTION 374.9 DB to go over section 8.1 again (addressing & amending the 8 points) and would circulate a new version. p16 (7.2 WP-B Tier-2 Effort) PC had noted a comment on this page. DB advised that RJ had added this sentence deliberately. DB would iterate with RJ. ACTION 374.10 DB to iterate with RJ regarding the added sentence by RJ on p16 (sect 7.2 WP-B Tier-2 Effort). p23 (8.3 Data Support) DB noted that the points made here could go into the Appendix. It was noted that at the top of p24 there was too much 'liaison' emphasis. Re point (2.) on p23, it was agreed to move the second half of the sentence (commencing: 'This post would ..' to '.. the resources') to the section above, which would help introduce the post descriptions. DB to do. ACTION 374.11 Re (8.3 Data Support): Re point (2.) on p23, it was agreed DB to move the second half of the sentence (commencing: 'This post would ..' to '.. the resources') to the section above, which would help introduce the post descriptions. p25 (8.3 Data Support) DB advised that this high-level objectives list was too long; and that it should also end at No 10. It would be better to reduce the list to 10 points. The following list of 8 points - did these help at all? PC advised that he would drop the tasks - they introduce confusion. This was agreed. ACTION 374.12 Re: (8.3 Data Support) DB to delete tasks 1-8 on page 25; and close the list above at point No 10. p24 (8.3 Data Support) Re point no 5 on p24: we could put 'Data Transfer' as a miscellaneous line in networking. DC felt it was ok here, it was not a weak point to make in the context. TD would prefer to change the points round to make it work better - he thought that 4a and 4b were not required, and they should be amalgamated. p25 (9 WP-D) DB advised that the CMS Storage Support Post on p26 was the weakest. DC to re- work the paragraph. ACTION 374.13 DC to re-work the paragraph on the CMS Storage Support Post (in 9 WP- D p26). DB noted that he needed all changes to be sent to him by Wednesday a.m. It was agreed to remove the word 'storage' from the CMS Storage Support Post (cf sect 8.3) ACTION 374.14 DC to remove the word 'storage' from the CMS Storage Support Post. DB asked SS to assist with the figure nos and formatting the document, on Tuesday afternoon. p29: DB asked if we should remove the table of PMB Roles? If it were to be included, it would need text. It could go in an Appendix. SP suggested making it shorter and have people in the list along with the roles assigned to them, also we could change the order. DB to sort, and add security. ACTION 374.15 DB to re-work the PMB Roles table, to include the people involved with their different roles, and also change the order. Security to be added. TD commented re the Impact Plan, that it had a long description for 1 x FTE. SP noted that this affects things externally to the project by way of external engagement. DB advised that the terms of reference do ask for an Impact Plan, so he felt it was ok that there is more info here than would be usual for a 1 x FTE. Was there an 'impact' person at Cambridge? SP wanted to leave it as is for the OC. DB advised of the background from STFC in relation to the 'impact' requirement. TD noted it strengthens our chance of success. SP advised that we may only get 0.5FTE and then we would have no dissemination function. It was suggested that we say upfront that re QM and Cambridge it would be 2 x half posts, depending, with the 1 x FTE going to QM initially? SP would update the impact section. ACTION 374.16 SP to update the impact section with a note on the posts breakdown. p32 (12 The Wider Context) TD would revise this by wed a.m. ACTION 374.17 TD to provide a revision of section 12 by wednesday morning. DB asked if there was an OC paper on cloud computing? TD noted no. DB advised that no more than one page was required for the OC. TD to do. ACTION 374.18 TD to provide no more than one page on cloud computing for the OC doc. p36 (13 Resource Request) DB was looking at Table 8 - if 30 million were a possibility, what do we put below the line? On working allowance and contingency - these need to be tied into risk assessment, eg: a likelihood of 0.5 @ 2 (hardware costing model) on Table 10 was not correct? SP said it was difficult to get 'high risk' any other way because of the way STFC view the figures. DB thought the risk table figures unlikely - it was not 50% likely that we get No 2 wrong. Could we do this on our own form? SP noted no - they explicitly sent us a risk template. DB noted it was difficult to say what a reasonable uncertainty was. There ensued a discussion on hardware requirements. DB would send TD the hardware costings and requirements for TD to look at the figures (he noted they looked different to the hardware lines in GridPP3). ACTION 374.19 DB to send TD the hardware costings and requirements for TD to look at the figures (he noted they looked different to the hardware lines in GridPP3). DB noted that for Table 9 we needed to separate above and below the line costs - SP to do (p39). ACTION 374.20 SP to separate the figures showing costs above and below the line in Table 9 (sect 13.9 Resource Request Summary). p41 (Project Level Risks Table 10) DB noted that table 10 needed to be revisited. JG thought they might use this to work out the contingency. p43 (Table 11) DB noted that working allowance should be under the control of the Project, and could be included in the £30 million. Contingency, by comparison, was hypothetical and below the line. The working allowance should not be that large, and we should reduce contingency. SP would reduce the numbers in the risk tables. SP/DB to iterate on Table 11. ACTION 374.21 SP to reduce the numbers in the risk tables. SP/DB to iterate on Table 11. 2. Preparations for the Oversight Committee ============================================ DB had comments on the Resource Report - SP should speak to him following the meeting. Project Status Document ----------------------- DB had amalgamated the information. JG had contributed LCG status. RM - something on the EU was required. RM reported this was ongoing. ACTION 374.22 RM to contribute text on EU for the Project Status Doc (OC doc). AS had sent round the Tier-1 status, he will update with other info he had missed. Deployment Status ----------------- JC had sent text - DB noted it looked quite long. DB/JC to iterate. ACTION 374.23 DB/JC to iterate on the deployment status document for the OC. GP noted that for User Reports he had done LHCb and 'Other'. Nothing had been received from ATLAS or CMS. DC to do. ACTION 374.24 RJ/DC to provide a paragraph of info to GP for the User Reports. It was thought that some collision pictures would be good to include if possible. DB noted he had a picture of collisions for all four experiments on one of his talks on the web, DC could take it from there. Dissemination/KE/EI ------------------- SP had circulated this. DB would tie-up all of the inputs by Thursday. It was noted that there was no time left this week to discuss the usual formal reports. Highlights were as follows: STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported the following: - a major intervention starting today re UPS power supply, planned to finish it today - programme of moving the database service plus a variety of other changes over the next 3 days - procurement was going well with deliveries on schedule - disk server installation fix had been carried out, disks replaced, first servers were due this week - they are experiencing a high failure rate on half of the 2006 hardware (3-years life at present) SI-2 ATLAS weekly review & plans --------------------------------- RJ was not present SI-3 CMS weekly review & plans ------------------------------- DC reported they were doing reprocessing; things were quiet; they had been told to expect 'two years of straight running'. SI-4 LHCb weekly review & plans -------------------------------- GP reported that the RAL Tier-1 was ok at present although they had job failures from users in relation to applications; disk servers had failed but were redeployed. SI-5 Production Manager's Report --------------------------------- JC reported as follows: There were no objections to the setting up of the super-b VO at the Tier-1 (raised in a mail from me to the PMB on 7th December). GP had approved an allocation and therefore the work had commenced. JC understood that the VO was also going to be supported at QMUL and would in the medium term be requesting wider Tier-2 support. Could we add it to the list of approved GridPP VOs? If so they will be enabled with the default fairshares. The PMB agreed yes, the VO should be supported SI-6 LCG Management Board Report --------------------------------- JG reported there was an action on the UK to find a Tier-2 person to contribute to OPN work. Robin Tasker was the main person on OPN. JC would ask at the dTeam meeting. It was noted it would have to be someone who knew about dataflow, possibly Brian. ACTION 374.25 JC to ask at the dTeam meeting if there was anyone from the Tier-2 who could contribute to OPN work (someone with dataflow experience). JG reported that ALICE had been running 3000 jobs at RAL over Christmas, and at present. AOB === There was a brief discussion on EGI and bridging plans via GridPP. REVIEW OF ACTIONS ================= 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There were a few suggestions in the deployment team about how to progress in this area. It needs writing up and an implementation plan. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. A meeting had been held before Christmas re a transition plan. SP was awaiting a skeleton outline plan from RM, allocating people to sections. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. AS noted this depended on money costs. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? Early action on AS required. AS had sent tech questions round the team and would forward inputs when available. DC noted to the meeting that today was the 16th Nov - only 4 weeks remained until Imperial, by which time we needed to have made extensive progress. To be discussed at the F2F on Friday. 367.2 RM to fill-in the grey boxes on DB's UK NGI diagram of a minimal NGI, as to what NGS would be doing in the areas listed. 372.1 ALL: to bring their 'top 3' high-level risks for GridPP4 to the Friday F2F. Areas of key responsibility should be considered (rather than risks across the project overall). If everyone could also provide a 'top 12' overall, but 'top 3' specifically, that would be preferred. 372.2 AS to circulate a timeline for disk server installation completion and drives replacement. 373.1 Re (4.1) EGI: JG to fix the paragraph and send TD the bits he had cut out. 373.2 RE (4.1) EGI: TD would respond by amending his section 13 accordingly. 373.3 Re (5) UK Grid: SP to change the Project Map by increasing the size of the words within the boxes, remove the numbers, the fonts generally to be increased in size. Take out the navigation links. 373.4 Re (5) UK Grid: SP to check the reference to the last OC paper, which seemed incorrect. 373.5 Re (5) Highlights of GridPP3: SP to track down the more recent OC quote regarding the success of STEP'09. 373.6 NG, PC & RJ to send comments to DB on section 5.1 - GridPP4 high-level view. 373.7 Re (6.1) RJ to provide DB with text for the ILC etc ('design studies for future linear colliders' or similar). 373.8 DB to add a contingency line to the table in section 6.3. 373.9 JG to provide an explanatory sentence for HEPSPEC for section 7.1. 373.10 DB to remove 'try to' throughout. 373.11 AS to do a scenario on electricity costs for WP-A based on new input from DB. AS/DB to iterate. 373.12 SL to re-arrange the text argument sections in 7.2. 373.13 DB to remove 'simulation' and 'local' from GridPP4 Roles table (p21), and insert instead: ATLAS Group & User Analysis, CMS Group & User Analysis, etc. 373.14 RJ/AS to harmonise statements on changes to the Tier-1 and Tier-2 sizes. 373.15 Re Tier-2 H'ware plots: SL to re-do his table according to the new numbers sent by DB, and extend to 2015. The last sentence needed work. 373.16 SL to amend the CPU and disk tables in the Tier-2 Hardware section, and an explanation of the table was required. 373.17 JC to send DB his revised draft of WP-C. 373.18 GP to re-site the summary of WP-D within the resource section. 373.19 Re WP-D: GP to to keep the best case, remove the worst case, and move the contingency issue to the contingency section. 373.20 Re WP-D: GP to note 1 x FTE in the area of technical support and documentation/trouble-shooting, footnoted as 2 x 0.5 FTE which would be 'experiment-specific technical support including non-LHC experiments and assisting with technical middleware problems'. 373.21 NG/RJ to send comments to TD regarding WP-E and broadening the experiment-related data post(s). 373.22 TD to remove the detail and leave the general points on p30, providing a high-level justification rather than an objectives list. 373.23 DB to remove the arrows from the diagram in WP-F. Also remove the first sentence of the following paragraph. 373.24 ALL: to send any comments to SP on the Impact section. 373.25 TD to re-visit the Wider Grid Context section and amend tone and purpose of the section as per comments received. 373.26 Re (14) Resource Request: DB to send SL the hardware costings to let SL check the Tier-2 sum. 373.27 Re (14) Resource Request: SP/DB need to converge on the remainder of the section. 373.28 ALL: to send risks to SP by email, including mitigations, by tomorrow ACTIONS AS AT 25.01.10 ====================== 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There were a few suggestions in the deployment team about how to progress in this area. It needs writing up and an implementation plan. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. A meeting had been held before Christmas re a transition plan. SP was awaiting a skeleton outline plan from RM, allocating people to sections. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. AS noted this depended on money costs. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? Early action on AS required. AS had sent tech questions round the team and would forward inputs when available. DC noted to the meeting that today was the 16th Nov - only 4 weeks remained until Imperial, by which time we needed to have made extensive progress. To be discussed at the F2F on Friday. 367.2 RM to fill-in the grey boxes on DB's UK NGI diagram of a minimal NGI, as to what NGS would be doing in the areas listed. 372.1 ALL: to bring their 'top 3' high-level risks for GridPP4 to the Friday F2F. Areas of key responsibility should be considered (rather than risks across the project overall). If everyone could also provide a 'top 12' overall, but 'top 3' specifically, that would be preferred. 372.2 AS to circulate a timeline for disk server installation completion and drives replacement. 373.1 Re (4.1) EGI: JG to fix the paragraph and send TD the bits he had cut out. 373.2 RE (4.1) EGI: TD would respond by amending his section 13 accordingly. 373.3 Re (5) UK Grid: SP to change the Project Map by increasing the size of the words within the boxes, remove the numbers, the fonts generally to be increased in size. Take out the navigation links. 373.4 Re (5) UK Grid: SP to check the reference to the last OC paper, which seemed incorrect. 373.5 Re (5) Highlights of GridPP3: SP to track down the more recent OC quote regarding the success of STEP'09. 373.6 NG, PC & RJ to send comments to DB on section 5.1 - GridPP4 high-level view. 373.7 Re (6.1) RJ to provide DB with text for the ILC etc ('design studies for future linear colliders' or similar). 373.8 DB to add a contingency line to the table in section 6.3. 373.9 JG to provide an explanatory sentence for HEPSPEC for section 7.1. 373.10 DB to remove 'try to' throughout. 373.11 AS to do a scenario on electricity costs for WP-A based on new input from DB. AS/DB to iterate. 373.12 SL to re-arrange the text argument sections in 7.2. 373.13 DB to remove 'simulation' and 'local' from GridPP4 Roles table (p21), and insert instead: ATLAS Group & User Analysis, CMS Group & User Analysis, etc. 373.14 RJ/AS to harmonise statements on changes to the Tier-1 and Tier-2 sizes. 373.15 Re Tier-2 H'ware plots: SL to re-do his table according to the new numbers sent by DB, and extend to 2015. The last sentence needed work. 373.16 SL to amend the CPU and disk tables in the Tier-2 Hardware section, and an explanation of the table was required. 373.17 JC to send DB his revised draft of WP-C. 373.18 GP to re-site the summary of WP-D within the resource section. 373.19 Re WP-D: GP to to keep the best case, remove the worst case, and move the contingency issue to the contingency section. 373.20 Re WP-D: GP to note 1 x FTE in the area of technical support and documentation/trouble-shooting, footnoted as 2 x 0.5 FTE which would be 'experiment-specific technical support including non-LHC experiments and assisting with technical middleware problems'. 373.21 NG/RJ to send comments to TD regarding WP-E and broadening the experiment-related data post(s). 373.22 TD to remove the detail and leave the general points on p30, providing a high-level justification rather than an objectives list. 373.23 DB to remove the arrows from the diagram in WP-F. Also remove the first sentence of the following paragraph. 373.24 ALL: to send any comments to SP on the Impact section. 373.25 TD to re-visit the Wider Grid Context section and amend tone and purpose of the section as per comments received. 373.26 Re (14) Resource Request: DB to send SL the hardware costings to let SL check the Tier-2 sum. 373.27 Re (14) Resource Request: SP/DB need to converge on the remainder of the section. 373.28 ALL: to send risks to SP by email, including mitigations, by tomorrow 374.1 NG to work on (4. International Context) to ensure correct tone and logic of flow. 374.2 SP to rework the EGEE and wLCG percentages on p4 beneath the Project Map (5. UK Grid ..) and either relate them or remove one of them. 374.3 NG and PC to iterate on the three highlighted paragraphs in section 5.1. 374.4 DB to work on Appendix B. 374.5 AS to better phrase and/or justify highlighted paragraph in section 6.3, p9. 374.6 DB to amend the wording highlighted yellow on p10 (6.4 UK Network Requirements). It was agreed to change the wording to: 'during GridPP4' rather than specifying certain dates. 374.7 DB to sort out paragraph on service requirements (p11 - 6.5 UK Service Requirements) and ensure it is correct. 374.8 JC to send answers to the comments made about section 8.1. 374.9 DB to go over section 8.1 again (addressing & amending the 8 points) and would circulate a new version. 374.10 DB to iterate with RJ regarding the added sentence by RJ on p16 (sect 7.2 WP-B Tier-2 Effort). 374.11 Re (8.3 Data Support): Re point (2.) on p23, it was agreed DB to move the second half of the sentence (commencing: 'This post would ..' to '.. the resources') to the section above, which would help introduce the post descriptions. 374.12 Re: (8.3 Data Support) DB to delete tasks 1-8 on page 25; and close the list above at point No 10. 374.13 DC to re-work the paragraph on the CMS Storage Support Post (in 9 WP- D p26). 374.14 DC to remove the word 'storage' from the CMS Storage Support Post. 374.15 DB to re-work the PMB Roles table, to include the people involved with their different roles, and also change the order. Security to be added. 374.16 SP to update the impact section with a note on the posts breakdown. 374.17 TD to provide a revision of section 12 by wednesday morning. 374.18 TD to provide no more than one page on cloud computing for the OC. 374.19 DB to send TD the hardware costings and requirements for TD to look at the figures (he noted they looked different to the hardware lines in GridPP3). 374.20 SP to separate the figures showing costs above and below the line in Table 9 (sect 13.9 Resource Request Summary). 374.21 SP to reduce the numbers in the risk tables. SP/DB to iterate on Table 11. 374.22 RM to contribute text on EU for the Project Status Doc (OC doc). 374.23 DB/JC to iterate on the deployment status document for the OC. 374.24 RJ/DC to provide a paragraph of info to GP for the User Reports. 374.25 JC to ask at the dTeam meeting if there was anyone from the Tier-2 who could contribute to OPN work (someone with dataflow experience). INACTIVE CATEGORY ================= 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- JC would check the situation with Graeme Stewart (who was currently on annual leave). JC followed up with Graeme and the other experiments. A test was started but this area has been deemed low priority and further progress is not expected for some time. ATLAS see no issues with contention. LHCb are not intending to pursue anything in this area. A CMS discussion has started but again it does not appear to be urgent. If the experiments are not pushing this internally then there is nothing for the deployment team to follow up! It was noted there was no priority in ATLAS at present, this will be pending for a while. Move to inactive as it is a long-term action. --------------------- The next PMB would take place on Monday 1st February at 12:55 pm.