GridPP PMB Minutes 375 (01.02.10) ================================= Present: David Britton (Chair), Sarah Pearce (remote), Andrew Sansum, Tony Doyle, Jeremy Coles, Pete Clarke, Steve Lloyd, Tony Cass (remote), Glenn Patrick, Roger Jones, Dave Colling, Neil Geddes (Suzanne Scott, Minutes) Apologies: David Kelsey, Robin Middleton, John Gordon 1. RAL situation ================= AS reported that last Thursday there had been downtime in order to move from the miscellaneous hardware and back onto the EMC (not powered by the UPS). This would enable them to re- acquire resilience and be on more powerful hardware. On Thursday the CASTOR migration had a problem, then one of the databases went down as 'unrunnable'. They had to wait for advice from Oracle, but it was brought up within an hour, then on reboot of one of the nodes in the Pluto rack, the rest of the database crashed. By Friday they hadn't found the root cause; by Saturday they had a stable storage network but there were unexplainable issues in the RAID arrays, causing Oracle to crash. CASTOR had been tested and is fine. AS had circulated an informal summary. AS noted that the underlying cause had not yet been identified and an early solution was unlikely. AS reported that consensus reached was that it was better to be in a known problem, then to bring the service up as promptly as possible despite the peculiar behaviour of the hardware, rather than spend time working out what went wrong. DB asked if it was related at all to 2.1.7? AS noted no, not CASTOR, this problem seemed to be related to the interaction between Oracle and the multi-fibre channel paths/multiple switches. TD suggested that the Oracle header being re-written was a new factor introduced, a possible change that had affected things. TD asked to clarify that the recommendation was to wait and continue with the original plan? AS advised that the recommendation was re-starting with a cut-down, simplified, configuration that was tested. Then, in a measured way, review and understand why the implementation hadn't worked. DB asked whether the simplified version was as good as that which had worked on the overland kit? AS noted yes, with slightly better resilience, everything would be runable on it. AS noted that it had run ok over the weekend and had been tested - they were doing some last checks at the moment. TC asked how much help was coming from CERN? AS reported he had asked on Friday but they hadn't been able to offer any suggestions. DB noted that the risk of the current proposal was that the EMC with a simple configuration had shown strange behaviour, and how was it possible to resolve this if they were running on the kit they needed to investigate? AS advised that they could move onto the LFC FTS hardware - but it was best accepting where we were now and double-checking everything - better to stay in a known place rather than introducing new errors. DB concluded that the PMB couldn't double-guess the risks etc, but wanted to register our deep concern at the situation. It was agreed to proceed according to AS's recommendation, but DB advised that if they don't come out of downtime as planned on Wednesday then this should become a Level 3 in the Disaster Plan. AS would provide more info tomorrow (Tuesday). 2. GridPP4 Proposal - next steps ================================= DB advised that he had looked at the proposal again today and it seemed in reasonable shape. He re-iterated his thanks to all concerned for much hard work in a short space of time. DB noted the timeline: - input was required for v10 before Feb 12th - after Feb 12th DB was away - SL would take the token on Fri 12th for one week and get inputs. PC, NG, and SP agreed to spend some time reviewing the whole proposal; other PMB members were encouraged to review at least the sections directly related to their roles. - DB would take the token back on Feb 22nd Inputs were needed as follows: Appendix A - post descriptions for the experiment-specific posts were required - GP to provide. ACTION 375.1 GP to provide post descriptions for experiment-specific posts in Appendix A. Post descriptions were also required for the Tier-2 posts - these should be as unique as possible in order to present a strong case. DB to co-ordinate. ACTION 375.2 DB to co-ordinate post descriptions for the Tier-2 posts, which should be as unique as possible in order to present a strong case. Other post descriptions required: ACTION 375.3 TD to do the data posts. 375.4 PMB ALL: those relevant to do their own post descriptors. 375.5 DB to do the Admin Asst post. 375.6 SP to do the Impact post. It was noted that inputs were required by this Friday 5th, or Monday 8th latest. DB noted that another area requiring work was risk assessment. This was pencilled in for next week's PMB. Re hardware planning, DB asked whether the new, likely LHC schedule might change the hardware figures? It was expected that running would continue in 2010-11, with a year off in 2012. DC noted he was still iterating with Ian, and will have an update in the next few days, plus the new schedule. DC did not anticipate any huge changes. DB noted that a paper was probably required to explain the hardware figures. It was agreed that SL would put a copy of the GridPP4 proposal (v9.3) with the other CB documents. ACTION 375.7 SL to put v9.3 of the GridPP4 proposal with the other CB documents. 3. Weekly Notes ================ EGI --- It was reported that the ROSCOE SSC bid may not have been successful, similarly the CUE proposal, however confirmation of this was awaited. It was noted that this would have an impact on GridPP in that 2 x 0.5FTE posts for Ganga would not be funded. OC prep ------- DB advised that SP should prepare a presentation on the GridPP3 project status. DB would do a talk on GridPP4. On GridPP4 they would be raising contingency/working allowance and risk, in order to get guidance. Re the RMR issue - was there a valid work breakdown structure? Other information required was on risks, the schedule, costs, management. The two areas currently missing were the work breakdown and the schedule. It was understood that STFC needed to be able to tick their boxes. A blank copy of the Project Map might suffice. SP would attend to this, and would iterate with DB on Wednesday prior to the OC meeting. ACTION 375.8 SP to organise a blank copy of the Project Map, and iterate with DB on the work breakdown and schedule. AOB === Travel ------ It was noted that there was a new system for booking travel via Key Travel, which was proving to be both cumbersome and more expensive than previously. It was decided to stop using this system at Glasgow; and for GridPP, people should be aware that it is costing 10% more to book in this way, and alternative arrangements for organising travel should be found. EGI --- NG reported that there would be no information on the EU proposal until March. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) The disk drives on our problematic lot of disk servers were replaced over late December and early January. Acceptance testing is going well - the first few servers have completed testing and are moving towards being deployed. The remaining servers will come through the system by 17th February. 2) FY09 procurements: - Delivery of the disk servers is scheduled for Mid February with a second tranche from one supplier on March 4th. - CPU deliveries are scheduled for mid February (delivery from one supplier on time looking increasingly problematic). 3) Corrective work on the UPS room supply was carried out. A 50% improvement was achieved. We are waiting for the formal assesment of how to proceed. 4) We have an anomalously high drive eject rate from one "tranche" of the FY06 procurement. This high eject rate probably accounts for all the filesystem failures experienced. The situation is not helped by the fact that these servers are only RAID 5 and are vulnerable to concurrent 2 drive failures. It is likely that we will need to phase these servers (about 250TB) out of operation very soon, well ahead of their planned 4 year life. We are assessing the situation and will make detailed proposals shortly. Service: 1) SAM test availability for the ops VO was 0%, this was owing to a scheduled downtime being followed by unscheduled problems restarting the database service for CASTOR after problems were experienced with the Storage Area Network. SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that things were quiet, some work was ongoing, RAL was currently down. SI-3 CMS weekly review & plans ------------------------------- DC reported that reliability at the Tier-2 was good except at Imperial, which lost a disk server with data. SI-4 LHCb weekly review & plans -------------------------------- GP reported that things were quiet generally, RAL was currently down. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) There were some problems with the APEL database last week. We will check publishing status/records for gaps. 2) The DPM developers have asked for input on future developments. The storage group is collating the GridPP response. 3) The GridPP VOMS machines have been upgraded. A memory leak with VOMS has been reported https://savannah.cern.ch/bugs/?60394. Meanwhile the server certificate on voms.gridpp.ac.uk is due to expire on the 11th of February. The new certificate will be installed on the 1st of February between 8 and 8.30 am UTC. Links to the new certificate (pem / rpm / yum) can be found on http://www.gridpp.ac.uk/wiki/Instruction_for_VO_administrators#Current_certificates 4) The December Tier-2 reliability and availability figures are now available: https://twiki.cern.ch/twiki/bin/viewfile/LCG/SamMbReports?filename=Tier2_Reliab_200912.pd f The GridPP results (reliability:availability) are very encouraging: LondonGrid (94%:91%) ; NorthGrid (98%:98%); ScotGrid (98%:98%) and SouthGrid (98%:97%). 5) Last week we were asked for a Tier-2 representative for OPN discussions. Brian was suggested and nobody else was put forward at the deployment team meeting. Brian has agreed to take on this role – we are finding out more about the context for these discussions. SI-6 LCG Management Board Report --------------------------------- DB reported that there had been a weekly ops report; an agenda was available for an upcoming meeting; glexec and multi-user-pilot jobs had been discussed. SI-7 Dissemination Report -------------------------- SP reported on an event at CERN for 7 TeV collisions on a Monday early in March, possibly 8th. STFC would be holding a similar event in the UK. There would be a Tier-1 Open Day on 30th - DB was doing a talk on behalf of GridPP. AOB === Re Cloud Computing, TD reported that Southampton had been doing work on this, also apparently, St Andrews University. Discussions would be taking place with Will Venters. The OC meeting would be taking place on Thursday 4th February. The next PMB would take place on Monday 8th February. DB would be away on Monday 15th February - it was hoped that JG would Chair. REVIEW OF ACTIONS ================= 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There were a few suggestions in the deployment team about how to progress in this area. It needs writing up and an implementation plan. Ongoing. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. A meeting had been held before Christmas re a transition plan. SP was awaiting a skeleton outline plan from RM, allocating people to sections. This action to be re-allocated to RM. Done for SP - action closed. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. AS noted this depended on money costs. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? Early action on AS required. AS had sent tech questions round the team and would forward inputs when available. DC noted to the meeting that today was the 16th Nov - only 4 weeks remained until Imperial, by which time we needed to have made extensive progress. To be discussed at the F2F on Friday. AS noted that alternative further costings were required. AS to progress. Ongoing. 367.2 RM to fill-in the grey boxes on DB's UK NGI diagram of a minimal NGI, as to what NGS would be doing in the areas listed. 372.1 ALL: to bring their 'top 3' high-level risks for GridPP4 to the Friday F2F. Areas of key responsibility should be considered (rather than risks across the project overall). If everyone could also provide a 'top 12' overall, but 'top 3' specifically, that would be preferred. Done, item closed. 372.2 AS to circulate a timeline for disk server installation completion and drives replacement. Done, item closed. 373.1 Re (4.1) EGI: JG to fix the paragraph and send TD the bits he had cut out. Done, item closed. 373.2 RE (4.1) EGI: TD would respond by amending his section 13 accordingly. Done, item closed. 373.3 Re (5) UK Grid: SP to change the Project Map by increasing the size of the words within the boxes, remove the numbers, the fonts generally to be increased in size. Take out the navigation links. Done, item closed. 373.4 Re (5) UK Grid: SP to check the reference to the last OC paper, which seemed incorrect. Done, item closed. 373.5 Re (5) Highlights of GridPP3: SP to track down the more recent OC quote regarding the success of STEP'09. Done, item closed. 373.6 NG, PC & RJ to send comments to DB on section 5.1 - GridPP4 high-level view. Done, item closed. 373.7 Re (6.1) RJ to provide DB with text for the ILC etc ('design studies for future linear colliders' or similar). Done, item closed. 373.8 DB to add a contingency line to the table in section 6.3. Done, item closed. 373.9 JG to provide an explanatory sentence for HEPSPEC for section 7.1. Done, item closed. 373.10 DB to remove 'try to' throughout. Done, item closed. 373.11 AS to do a scenario on electricity costs for WP-A based on new input from DB. AS/DB to iterate. Done, item closed. 373.12 SL to re-arrange the text argument sections in 7.2. Done, item closed. 373.13 DB to remove 'simulation' and 'local' from GridPP4 Roles table (p21), and insert instead: ATLAS Group & User Analysis, CMS Group & User Analysis, etc. Done, item closed. 373.14 RJ/AS to harmonise statements on changes to the Tier-1 and Tier-2 sizes. Done, item closed. 373.15 Re Tier-2 H'ware plots: SL to re-do his table according to the new numbers sent by DB, and extend to 2015. The last sentence needed work. Done, item closed. 373.16 SL to amend the CPU and disk tables in the Tier-2 Hardware section, and an explanation of the table was required. Done, item closed. 373.17 JC to send DB his revised draft of WP-C. Done, item closed. 373.18 GP to re-site the summary of WP-D within the resource section. Done, item closed. 373.19 Re WP-D: GP to to keep the best case, remove the worst case, and move the contingency issue to the contingency section. Done, item closed. 373.20 Re WP-D: GP to note 1 x FTE in the area of technical support and documentation/trouble- shooting, footnoted as 2 x 0.5 FTE which would be 'experiment-specific technical support including non-LHC experiments and assisting with technical middleware problems'. Done, item closed. 373.21 NG/RJ to send comments to TD regarding WP-E and broadening the experiment-related data post(s). Done, item closed. 373.22 TD to remove the detail and leave the general points on p30, providing a high-level justification rather than an objectives list. Done, item closed. 373.23 DB to remove the arrows from the diagram in WP-F. Also remove the first sentence of the following paragraph. Done, item closed. 373.24 ALL: to send any comments to SP on the Impact section. Done, item closed. 373.25 TD to re-visit the Wider Grid Context section and amend tone and purpose of the section as per comments received. Done, item closed. 373.26 Re (14) Resource Request: DB to send SL the hardware costings to let SL check the Tier-2 sum. Done, item closed. 373.27 Re (14) Resource Request: SP/DB need to converge on the remainder of the section. Done, item closed. 373.28 ALL: to send risks to SP by email, including mitigations, by tomorrow. Done, item closed. 374.1 NG to work on (4. International Context) to ensure correct tone and logic of flow. Done, item closed. 374.2 SP to rework the EGEE and wLCG percentages on p4 beneath the Project Map (5. UK Grid ..) and either relate them or remove one of them. Done, item closed. 374.3 NG and PC to iterate on the three highlighted paragraphs in section 5.1. Done, item closed. 374.4 DB to work on Appendix B. Done, item closed. 374.5 AS to better phrase and/or justify highlighted paragraph in section 6.3, p9. Done, item closed. 374.6 DB to amend the wording highlighted yellow on p10 (6.4 UK Network Requirements). It was agreed to change the wording to: 'during GridPP4' rather than specifying certain dates. Done, item closed. 374.7 DB to sort out paragraph on service requirements (p11 - 6.5 UK Service Requirements) and ensure it is correct. Done, item closed. 374.8 JC to send answers to the comments made about section 8.1. Done, item closed. 374.9 DB to go over section 8.1 again (addressing & amending the 8 points) and would circulate a new version. Done, item closed. 374.10 DB to iterate with RJ regarding the added sentence by RJ on p16 (sect 7.2 WP-B Tier-2 Effort). Done, item closed. 374.11 Re (8.3 Data Support): Re point (2.) on p23, it was agreed DB to move the second half of the sentence (commencing: 'This post would ..' to '.. the resources') to the section above, which would help introduce the post descriptions. DB would check this - ongoing. 374.12 Re: (8.3 Data Support) DB to delete tasks 1-8 on page 25; and close the list above at point No 10. Done, item closed. 374.13 DC to re-work the paragraph on the CMS Storage Support Post (in 9 WP-D p26). Done, item closed. 374.14 DC to remove the word 'storage' from the CMS Storage Support Post. Done, item closed. 374.15 DB to re-work the PMB Roles table, to include the people involved with their different roles, and also change the order. Security to be added. Done, item closed. 374.16 SP to update the impact section with a note on the posts breakdown. Done, item closed. 374.17 TD to provide a revision of section 12 by wednesday morning. Done, item closed. 374.18 TD to provide no more than one page on cloud computing for the OC. Done, item closed. 374.19 DB to send TD the hardware costings and requirements for TD to look at the figures (he noted they looked different to the hardware lines in GridPP3). Done, item closed. 374.20 SP to separate the figures showing costs above and below the line in Table 9 (sect 13.9 Resource Request Summary). Done, item closed. 374.21 SP to reduce the numbers in the risk tables. SP/DB to iterate on Table 11. Done, item closed. 374.22 RM to contribute text on EU for the Project Status Doc (OC doc). Done, item closed. 374.23 DB/JC to iterate on the deployment status document for the OC. Done, item closed. 374.24 RJ/DC to provide a paragraph of info to GP for the User Reports. Done, item closed. 374.25 JC to ask at the dTeam meeting if there was anyone from the Tier-2 who could contribute to OPN work (someone with dataflow experience). Brian had agreed to do this. Done, item closed. ACTIONS AS AT 01.02.10 ====================== 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There were a few suggestions in the deployment team about how to progress in this area. It needs writing up and an implementation plan. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. AS noted this depended on money costs. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? Early action on AS required. AS had sent tech questions round the team and would forward inputs when available. DC noted to the meeting that today was the 16th Nov - only 4 weeks remained until Imperial, by which time we needed to have made extensive progress. To be discussed at the F2F on Friday. AS noted that alternative further costings were required. AS to progress. 367.2 RM to fill-in the grey boxes on DB's UK NGI diagram of a minimal NGI, as to what NGS would be doing in the areas listed. Ongoing. 374.11 Re (8.3 Data Support): Re point (2.) on p23, it was agreed DB to move the second half of the sentence (commencing: 'This post would ..' to '.. the resources') to the section above, which would help introduce the post descriptions. DB would check this. 375.1 GP to provide post descriptions for experiment-specific posts in Appendix A. 375.2 DB to co-ordinate post descriptions for the Tier-2 posts, which should be as unique as possible in order to present a strong case. 375.3 TD to do the data posts. 375.4 PMB ALL: those relevant to do their own post descriptors. 375.5 DB to do the Admin Asst post. 375.6 SP to do the Impact post. 375.7 SL to put v9.3 of the GridPP4 proposal with the other CB documents. 375.8 SP to organise a blank copy of the Project Map, and iterate with DB on the work breakdown and schedule. 375.9 RM to provide a skeleton outline plan, including post details, of GridPP/NGS convergence. [Previous action background: SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. A meeting had been held before Christmas re a transition plan. SP was awaiting a skeleton outline plan from RM, allocating people to sections. This action to be re-allocated to RM. Done for SP - action closed.] INACTIVE CATEGORY ================= 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- JC would check the situation with Graeme Stewart (who was currently on annual leave). JC followed up with Graeme and the other experiments. A test was started but this area has been deemed low priority and further progress is not expected for some time. ATLAS see no issues with contention. LHCb are not intending to pursue anything in this area. A CMS discussion has started but again it does not appear to be urgent. If the experiments are not pushing this internally then there is nothing for the deployment team to follow up! It was noted there was no priority in ATLAS at present, this will be pending for a while. Move to inactive as it is a long-term action. ---------------------