GridPP PMB Minutes 343 F2F (@ UCL) - 31st March 2009 ==================================================== Present: Sarah Pearce (Chair), David Britton, Tony Doyle, Andrew Sansum, Dave Colling, Roger Jones, Glenn Patrick, Steve Lloyd, Tony Cass, Robin Middleton, John Gordon, Jeremy Coles, Neil Geddes, Pete Clarke - and later, Andrew Richards (Suzanne Scott - Minutes) Apologies: David Kelsey 1. Issues for the next OC ========================== DB reported that he was trying to establish a date for the next OC meeting; and had recently received the Minutes of the last OC. DB summarised the action points: - backup OPN link in hand - no action required re MoU costs - re usage data for 'other' experiments, GP would provide this once the date of the next OC was known - ideal & fallback situation re NGS/EGI/NGI - this was work-in-progress, RM would do a paper once the date was known - CASTOR issues DB summarised OC feedback: 1. work on service resilience required to increase robustness 2. focus on users required, including issues of access to the Grid, storage, training: ACTION 343.1 GP to follow-up on the ILC experience to ensure issues are addressed There was a discussion of the user experience and ease of access (or otherwise) via the GridPP website, including: - monitoring & progressing help for users - contact names on the website were needed - what kind of likely new user would there be? - this needed to be addressed to ensure relevant instruction provision was available ACTION 343.2 GP to discuss new users coming to GridPP with NO (possibly also SB) regarding the GridPP website and making relevant changes that would assist them 3. all contributions from the project were to be as relevant/focussed as possible to KE & EI 2. IMENSE & N-Gram enquiry =========================== DB reported that he had received an email from Andy Parker regarding IMENSE and another company called iLexIR who were interested in being involved with GridPP. The N-gram database required 7-8000 CPU-months, and they were asking to use the Tier-1 before the LHC turns on. DB advised that the publicity would be good, as this would rival Google (where the current largest N- gram corpus is currently held). DB suggested that GridPP accept the proposal - it was one way of stress-testing the infrastructure. It would also count towards Tier-2 accounting and there could be a pilot phase. AS noted no benefit to the Tier-1. ACTION 343.3 Re IMENSE & the iLexIR N-gram enquiry, DB to draft a response to Andy Parker supporting the request. 343.4 JC to raise the N-gram issue at dTeam and update the VO ID card. 3. GridPP4 =========== DB advised that we needed to think about GridPP4. DB presented the GridPP3 finances and summarised the situation with Tier-1 hardware, Tier-2 hardware, Tier-1 and Tier-2 staffing levels, operations issues, management issues, travel costs. DB presented the EGEE III External Advisory Committee report (Spring '09) re cloud computing requirements. There was a discussion of pros and cons. RM asked about the timeline for GridPP4? DB advised that we couldn't do GridPP4 on the same timescale as GridPP3 (18 months overall from submission to award) - we needed a much shorter timeframe, STFC need to let us know as the process could take 12-18 months - we needed to start the process in September of this year but we do need to know some guidelines in order to have a preliminary draft by Christmas. DB emphasised the following issues: - GridPP needed to be driven by the needs of the LHC experiments as primary users - future vision needed to be well-planned - community support was essential in recognising experiment/institute/user roles 4. Tier-2 Hardware =================== DB reported that Janet Seed had sent an email regarding the 2nd tranche of Tier-2 HW money and the optimal timing of the awards from our point of view. ACTION 343.5: SP to contact Tier-2 P.I.'s to discuss Tier-2 grant timing. 5. Tier-1 Recruitment ====================== AS had circulated a document showing where staff funding lines had come from. AS summarised the current stage of recruitment, which was ongoing for: - 2 x 0.5FTE experiment support posts in PPD - 1 x 1FTE OC-approved effort for CASTOR dbase support - 1 x 1FTE EGEE-funded new post for PPS work (50% grid 50% CASTOR) - ESC-funded year industry student SP noted an underspend in 07-08 & 08-09 - we needed an outlook for the whole period from now until the end of GridPP3 - there may be headroom for extra staff over and above this. The original plan was for 17 staff, this will go to 19 peak in the hope of sustaining 18. 6. Disaster Planning ===================== AS gave a presentation on previous plans, current status, future direction. AS noted the following: - the initial attempt to complete the contingency plans was generally unsuccessful - advice had been given from site experts - the greatest failures experienced had evolved slowly and were rooted in project management - a new strategy was outlined - how to identify potential disasters RJ noted that 'disaster' meant different things for different services. AS continued: - the plan for an immediate response to an incident remained unchanged - anyone could decide that something warranted consideration as having potential to be 'a disaster' - for escalating the response there were stages 1-4 AS outlined inner workings & general process, including nomination of the Disaster Controller, and regular review meetings. AS noted that immediate experience had been reviewed, and the system worked well up to level 2. AS then reviewed the Tier-1 contingency plans, including trigger levels & escalation paths. It was noted that experiment contingency plans still needed to be understood. There was always interaction with 'externals' to be taken account of when a disaster was being worked on and resolved. AS concluded with a review of current status, and requirements. DB noted that we had been asking for some time for disaster planning at the Tier-1 and he thanked AS for his comprehensive review, pointing out however that this was a continuous process to be developed, to include running every instance through the scheme in the hope that we didn't get to level 4. DB noted substantial progress thus far. It was understood that experiment plans were not known, but good progress could be made without full knowledge of their plans. 7. GridPP-NGS Taskforce ======================== Andy Richards presented on this issue, advising that a Working Group had been set up to explore 'things in common' and approaches to NGI - this included: - EGI/NGI structure (timescale being pushed by EU) - GridPP & NGS funding both end in 2011 - the sharing of common expertise and identifying sensible overlaps - GridPP & NGS project names should/could continue, defining the respective communities AR continued by presenting the tasks of an NGI, and provided a list of 'common' tasks. It was noted that many current GridPP work areas could be taken over by NGS in his model. AR advised on the plans from 2009-11 relating to the CA, Helpdesk, Ops Teams, VOMS, EGEE CE. DB advised that the CA TAG had not yet met despite considerable pressure over the past months. DB also noted that a response from them on the issue of R89 was required. AR noted that the following areas needed to be included in 2009-11 planning: file sharing, monitoring, security co-ordination, GridPP & NGS future and joint funding/complimentary funding. There followed discussion about operational security as one area that had a strong overlap. AS noted that he had a review meeting coming up with DK - this would need discussion re widening Mingchao Ma's role to include NGS. AR then outlined the overall discussion required: milestone plans, how to co-ordinate, a potential aligned funding proposal in the next 2 years. RM noted there were two timescales, funding by 2011 but an NGI by 2010 - we have to do something within one year. DB noted that issues were services, manpower. AR's 'tasks of the EGI' slide was discussed. DB noted that a common helpdesk was possible, VOMS was not difficult to include. There was a discussion of bidding and funding (JISC, EPSRC, STFC). SP noted that some of the items under NGS in AR's table could be allocated to an NGI instead. AR provided a list of expertise and where it was currently located. ACTION 343.6 JC/JG to ensure that the joint taskforce meet and consider issues: work was clearly required on the table presented by AR - more columns needed to be added: NGS approach, GridPP approach, an outline map of funding, service overlaps. DB advised that unless STFC say otherwise, we will be working on GridPP4 before Christmas. RM noted that we need to be operational in NGI within the next year. DB suggested that this needs to be a Standing Item for the next two F2F meetings - a possible meeting at the end of May prior to the next OC. This was agreed. 8. Quarterly Reports ===================== SP reported that there were still issues outstanding - experiment red metrics for ATLAS, LHCb and other experiments. There was a discussion of availability and service degradation, it was noted that ATLAS still have the 'Big ID' problems. ACTION 343.7 AS to look into the CASTOR/ATLAS instance availability (in the light of the experiment red metrics). SP reported that there had been LHCb problems, and other experiments had had CPU efficiency difficulties. DC confirmed that CMS had lost files this Quarter. Re Data & Storage, SP cited orange metrics in relation to space tokens. SP noted that TD still had to report-back on the milestone overdue relating to storage usage per user in a VO. TD noted this issue was to do with the manual process involved, and APEL accounting not being standardised. It was understood that sites had to enable this themselves in order to do user accounting. ACTION 343.8 TD to follow-up overdue milestone in relation to storage usage per user in a VO. AS reported that R89 had been accepted last Friday, with an outstanding issue of airflow, but accepted within the parameters advised by the consultants and detailed in the Tender. Red metrics for the Tier-2 were discussed, as were red metrics for management. 9. NGI/EGI Milestones & Metrics ================================ RM presented on NGI Deliverables/Milestones; areas to monitor; NGI formation & the NGS. The issues covered as a work-in-progress were as follows: - EGEE transition to EGI and the impact on GridPP - GridPP requirements for EGI should be transparent to users - EGI transition planning document to focus on GridPP aspects - GridPP roadmap & sustainability through LHC - GridPP MoU with a UK NGI 10. AOB ======== 1. RM advised of travel budget comparisons from 2002-2009. 2. SP asked about the call for summer students? RM advised that this looked manageable in the context of the travel budget. Neasan O'Neill will put out a call fairly soon. TD noted that this involved a focus on dissemination and reminded that there had to be a report at the end of the period. NO will deal with this within the next few weeks (probably after Easter). DB would refer Akram to the call and ensure his case fits the criteria. ACTION 343.9 DB to contact Akram re the summer student call. 3. LHC OPN resilience costs: PC and Robin Tasker had circulated a report on costs for the resilient link for the OPN - they had recommended a 4GB backup provision between the Tier-1 and CERN. The per annum installation cost would be £51,726. Recurrent costs could be between £41,974 and £62,000. DB advised that the report would be presented at the next OC. RT had noted that a genuine risk was hard to assess. The PMB thanked Robin Tasker for his valued contribution to this issue. ACTIONS AS AT 31.03.09 ====================== 332.1 AS to provide a plan for the tape drives: The experiments would be producing new numbers by ~end of March which would enable this action to progress. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. PC had sent info round, and Robin Tasker would provide further info at GridPP 22 UCL. The issue would need to be referred back to the OC. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. 341.1 AS to review reslience of services that may have to remain in the ATLAS building. 341.3 NG to consult/inform GRIDPP PMB on responses to EU questionaires. 341.4 JG to establish new date for the wLCG-ORACLE meeting. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). 341.7 SP to chase non-functioning of GridMon with Robin Tasker and report back. 342.1 DB to circulate a draft letter from GridPP to Richard Wade in relation to the pending 2nd experiment support post at RAL [done following the meeting]. 342.2 SP to add-in a morning discussion time about GridPP4 to the F2F meeting; also to check if there were any issues arising from the Quarterly Reports that needed to be discussed. 343.1 GP to follow-up on the ILC experience to ensure issues are addressed. 343.2 GP to discuss new users coming to GridPP with NO (possibly also SB) regarding the GridPP website and making relevant changes that would assist them 343.3 Re IMENSE & the iLexIR N-gram enquiry, DB to draft a response to Andy Parker supporting the request to use the Tier-1 before the LHC turns on. 343.4 JC to raise the N-gram issue at dTeam and update the VO ID card. 343.5: SP to contact Tier-2 P.I.'s to discuss Tier-2 grant timing. 343.6 JC/JG to ensure that the joint taskforce meet and consider issues: work was clearly required on the table presented by AR - more columns needed to be added: NGS approach, GridPP approach, an outline map of funding, service overlaps. 343.7 AS to look into the CASTOR/ATLAS instance availability (in the light of the experiment red metrics). 343.8 TD to follow-up overdue milestone in relation to storage usage per user in a VO. 343.9 DB to contact Akram re the summer student call. The next F2F meeting would take place on 4th June 2009 at Imperial.