GridPP PMB Minutes 340 - 9th March 2009 ======================================= Present: David Britton (Chair), Sarah Pearce, Jeremy Coles, Steve Lloyd, Andrew Sansum, Pete Clarke, Robin Middleton, John Gordon, Dave Colling, Roger Jones, Neil Geddes (Suzanne Scott - Minutes) Apologies: Tony Doyle, David Kelsey, Tony Cass, Glenn Patrick 3. EGEE/EGI/NGI/NGS/GRIDPP =========================== NG had circulated an email from Dieter Kranzlmueller re the names for the EGEE/EGI working groups. Following the meeting, NG notified that the deadline for nominations was MARCH 12. DB advised that things were moving quickly regarding developments in Europe in relation to NGI. Issues had arisen from the notes of the EGEE PMB meeting, which NG had circulated. We were being asked to respond as a UK NGI and we needed to formalise this. The first issue related to the gLite Consortium - were we members? Were they sub-contracted? JG noted that a partner can be a person, not just an organisation. DB advised that it needed UK co-ordination. NG advised that the steer was that people who were actively involved in developing gLite now, were the ones who should be approached. After some discussion, it was agreed that APEL and GridSite were covered ok for the moment. The second issue related to the 'no cost' extension of EGEEIII and the issue of the membership fee. Discussion needed to take place. NG advised that JISC might pick-up the latter. DB noted that STFC was possibly not the right funding agency, but JISC and EPSRC needed to be kept informed. NG reported that he was keeping them informed but there was no 'ownership' as yet. DB noted the timescales were challenging and the process needed to be driven. NG recommended that a meeting was needed with the various people, to discuss moving forward - it might be possible for STFC to share the risk in the first year to cover the Letter of Intent and the £70k required. DB suggested that we need to set-up a shadow or proto-MB for an NGI in the UK to take these issues forward in the first instance. NG suggested that the existing JRU could be extended. JG reported that the JRU was not just about EGEE, it had a wider remit. DB noted that the JRU didn't make decisions. RM reported that they meet before each EGEE PMB, but can meet anytime and take decisions. DB asked if the JRU was appropriate, what was the representation in relation to decisions? RM advised that reps from a number of the Universities were on it, those who were being directly funded. RJ noted that this was historical though. DB agreed, emphasising that we need to think about future progression. DB also noted that the JRU was institute-based, therefore not appropriate for representing an NGI. RJ suggested putting an ad- hoc committee in place. DB agreed that something new needed to be set-up, as it stood the JRU was not sufficient - something would need to be constituted very quickly in order to drive a decision-making process. JG agreed that we needed to act quickly - the UK needed to be in the room when decisions were being taken and if we took too long to work out the UK representation, then we would miss the boat. DB emphasised that a meeting was required, an email list, appropriate representation identified, and very soon. PC agreed, and noted various likely candidates to be involved in the interim, possibly one year. NG noted he had not contacted them yet. DB advised that it was up to NG and RM to push this forward - either the setting-up of a committee or a proto-NGI within JRU. NG agreed to draft an email to EPSRC and JISC to suggest the need for setting-up some kind of MB. Two sets of people were required - high-level and those who could progress decisions at lower level - to make up a shadow MB. NG noted that his email had contained a spreadsheet showing the requested activites and suggestions for involvement by individuals with appropriate expertise. A transition plan was required and groups would co-ordinate activities. DB asked who would co-ordinate above this for the UK overall? RM advised that a JRU meeting was due to take place next week via AccessGrid on Monday @ 9.00 am. DB emphasised that we would need to continue to discuss this, and asked that opinions/suggestions be sent to him by email in order to stimulate discussion. DB would attend next week's JRU. NG would take names/suggestions from everyone. ACTION 340.1 ALL: to give inputs to DB regarding a proto-MB; to send suggestions for participation in JRU to NG. 1. R89 Status ============== AS had circulated an email giving an update to status. No estimate was available as to when the room would be finished. Problems with the air conditioning were ongoing. AS advised that 15th May was the latest possible date for the move - beyond that they would not be able to move the Tier-1 and be confident about data-taking - they would need, at that point, to stay in the ATLAS centre. After that time it would be difficult to move the existing hardware at all as no other suitable window appeared to be available. JG observed that the non Tier-1 kit could go to R89 once it was ready, leaving more headroom for the Tier-1 kit. There was a discussion about delivery of kit and longevity, also staff implications. AS noted that running the machine room 'remotely' would be inconvenient but not impossible. Delivery schedules were not yet available. 2. New LHC schedule ==================== DB asked RJ if the experiment requirements were now available? RJ noted not yet - he had received info which was inconsistent with the LHC schedule and was waiting on confirmation. DB advised that we needed to think about the 2010 hardware procurement. JG observed that the WLCG view was 'no change'. DC confirmed he would advise DB as soon as he knew about the CMS requirements. STANDING ITEMS ============== SI-1 Tier-1 Manager's report ----------------------------- AS reported as follows: Fabric: 1) R89 Machine Room. The building now has a fire certificate and has passed other building control milestones. Work continues to understand the air conditioning problem. If R89 is not available by 15 May 2009 (possibly 2 weeks earlier) then the Tier-1 will be unable to carry out the migration this year. We are reviewing the impact of remaining in Atlas centre this year but expect that the situation is managable (but undesirable). 2) Migration to R89. Planned for ~2 weeks commencing 22nd June (provided R89 is available). Possible alternative plan B date for 1 week migration of critical components only - commencing 6th July 3) Disk and robotics deliveries are pending on R89 availability. CPU deliveries will soon be in the same situation. We plan to guarantee delivery of disk, CPU and robotics into either R89 or ATLAS for delivery no later than 15 June 2009. 4) Remaining items on spend plan are progessing well. 5) An LHCB disk server failed on 2nd March. Filesystem failed fsck following reboot. LHCB files were declared lost on 3rd March. It may be that a problem (fsprobe detected problem but no RAID controller errors) thought resolved satisfactorily in November may have had ongoing consequences. Original cause of problems may have been that older 3Ware firmware was not detecting all disk failure modes satisfactorily. Investigations continue. Staff: Summary of staffing position: 1) Recruitments outstanding to reach original GRIDPP plan of 17 FTE a) 3rd team member of production team has accepted. b) Experiment Support posts (50% funded by T1 and 50% by experiments). Here covering T1 effort only. recently interviewed. i) 0.5 FTE Just about to make one formal offer subject to work permit. April-May start date? Negotiations continue. ii) Did not offer on second position, but are now in informal discussion with likely applicant. Have yet to decide how we can officially restart the recruitment (pending closure of first). 3) GRIDPP agreed that we could raise average booking to 18 FTE. We are close to being able to commence recruitment. Service: 1) SAM availability last week was 100%. 2) CASTOR a) We continue to chase the big ID problem and have sent some dumps to Oracle (but need to get further debug info). This problem is impacting availability for ATLAS (it is also slightly impacting LHCB). b) A major hardware upgrade will be needed on the CASTOR core database hardware. We will be testing migration rates in order to help us to assess the length of the planned downtime. The task is now substantially less as the ATLAS database has shrunk substantially following cleanup. c) CASTOR upgraded to 2.1.7-24, SRM 2.7-15. the upgrade went well, except for the Gen instance which overrun after problems with the configuration prevented restart. 3) An upgrade to FTS 2.1 was completed smoothly. SI-2 ATLAS weekly review & plans --------------------------------- RJ had left the meeting. SI-3 CMS weekly review & plans ------------------------------- DC reported that CMS was quiet at present; there had been problems at Imperial; nothing major to report. SI-4 LHCb weekly review & plans -------------------------------- In absentia, GP provided the following report: LHCB status is as follows: 1. Problem with all of CASTOR at CERN - a transparent intervention was not. https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090304 2. Problem with CASTOR at RAL - hanging FTS transfers again. Similar problems seen at CERN also. Shaun is deploying an automatic script to find and clear these transfers. 3. Lost diskserver at RAL. 25K DST files have been marked as lost within LHCb and all the recoverable files have been recovered from other sites. FEST week : 1. RAL performed quite well. One lost reconstruction job due to unknown reasons. The job ran fine when resubmitted. Other jobs were fine. 2. These reconstruction jobs also accessed and ran off the Conditions DB/3D successfully. Need to investigate the robustness of this system next (Oracle backend. How many simultaneous jobs can it handle?). 3. NIKHEF (COOL/Firewall) and IN2P3 (dCache) had problems. Outlook : 1. Over the last month, user jobs have been about half of all the LHCb activity on the grid. So far, no problems reported at RAL. 2. Waiting for fix to LHCb simulation application (jobs loop and run out of time) before restarting production on the Grid. Problem seems to be traced back to Geant4 - still being debugged. In meantime, all simulation activity on the Grid is stopped, pending resolution of this problem. SI-5 Production Manager's report --------------------------------- 1) There will be a UKI/DTEAM meeting in London this Thursday and Friday. One day of the meeting will focus on our operations model and tasks in the new regional approach being adopted by EGEE ROCs and the other day will review more of the GridPP specific aspects. A draft agenda can be found here: http://indico.cern.ch/conferenceDisplay.py?confId=53442. If there are areas which you think should be covered but are not currently listed, please let me know. 2) February's EGEE availability and reliability report is now available https://edms.cern.ch/document/963325/. The UKI figures were 94% availability and 94% reliability. The main problems seen during the month were at Imperial College sites (HEP and LeSC) due to storage issues; UCL-CENTRAL due to network and storage changes; Manchester due to space problems on the CE disks, and Cambridge due to Condor testing and network problems. 3) Phenogrid has requested that DN encryption be enabled at all sites that support it in order that they can follow usage patterns by users. It would be a simple switch but the current default in YAIM is for it to be off. This is due to be discussed at this week's GDB. If there are no showstoppers we should ask GridPP sites to switch on after that. We would prefer to wait until EGEE makes it the default but can change earlier. SI-6 LCG Management Board report --------------------------------- There had been no MB last week. SI-7 Dissemination report -------------------------- SP reported that Neasan O'Neill had been in Sicily last week with the EGEE team. There had been an article in Times Online about the EGEE meeting based on a press release that Neasan had written. REVIEW OF ACTIONS ================= 332.1 AS to provide a plan for the tape drives, given the new information from IB - an detailed plan was required immediately, showing minimum spend relating to no data etc. Ongoing. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. PC had sent info round, and Robin Tasker would provide further info at GridPP 22 UCL. The issue would need to be referred back to the OC. Ongoing. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). Ongoing. 337.4 JG to circulate details of current plan and progress in relation to FTE effort on storage at RAL. Ongoing. 338.3 DB to raise the issue at the MB that, in response to job failures at ATLAS, a 'standard' solution at the back end of the SRM would be the best solution for all, instead of organising locally (implementing 'data movers' using non-grid tools) by experiment. DB to contact RJ to clarify this issue. DB had contacted RJ about this - ongoing. 338.4 JC to raise at dTeam the issue of blacklisting, suggesting delegated authority to experiments to provide blacklisting metric information - metric is required similar to freedom-of- choice tool. JC reported that this was raised at last week's DTEAM meeting. Only ATLAS was represented and the response was that ATLAS could do it, but none of the status changes were recorded at the moment, so would require development from the panda people. JC would raise the issue again this week. Done, item closed. 338.7 AS to provide a 'review and report' on each of the 8 issues in turn, where the Tier-1 failed to meet the Q4 milestones, showing in detail why these were not met. In addition, a definite plan for completion of each was also to be provided. Discussion would follow provision of these reports. Report was circulated - SP/DB to respond. Done, item closed. 338.9 AS to circulate an email in due course relating to the PMB decision that the end of June was the latest date for migration to R89 - beyond which the move would not happen in 2009. Done, item closed. 339.1 DC to advise DB this week which individual would do the CMS experiment talk at GridPP22. Chris Brew. Done, item closed. 339.2 GP to advise DB this week which individual would do the 'other experiments' talk at GridPP22. GP would do the talk. Done, item closed. 339.3 AS to advise DB who the other speaker would be for the 2nd Tier-1 talk at GridPP22. 339.4 DB to contact DK to discuss the individual who could give the security talk at GridPP22. Mingchao Ma. Done, item closed. 339.5 DB to contact JG in relation to the 'micro-talks' on external services (CA, VOMS, GOC helpdesk or APEL) at GridPP22. JG doing. Done, item closed. 339.6 DB to email NG and RM in relation to the session at GridPP22 on plans for EGEEIII/EGI. RM would revisit the talk he gave at the PMB F2F, updated and for a wider audience. Done, item closed. 339.7 SL to advise the DB Agenda for GridPP22. Done, item closed. 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. Ongoing. ACTIONS AS AT 09.03.09 ====================== 332.1 AS to provide a plan for the tape drives, given the new information from IB - an detailed plan was required immediately, showing minimum spend relating to no data etc. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. PC had sent info round, and Robin Tasker would provide further info at GridPP 22 UCL. The issue would need to be referred back to the OC. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). 337.4 JG to circulate details of current plan and progress in relation to FTE effort on storage at RAL. 338.3 DB to raise the issue at the MB that, in response to job failures at ATLAS, a 'standard' solution at the back end of the SRM would be the best solution for all, instead of organising locally (implementing 'data movers' using non-grid tools) by experiment. DB to contact RJ to clarify this issue. 339.3 AS to advise DB who the other speaker would be for the 2nd Tier-1 talk at GridPP22. 339.8 JC to follow-up VO Registration cards. 340.1 ALL: to give inputs to DB regarding a proto-MB; to send suggestions for participation in JRU to NG. INACTIVE CATEGORY ================= 282.8 RM to monitor how R-GMA and networking issues impact on GridPP as matters progress. RM advised that this item should be moved to the 'inactive' category as it will develop over the coming months. RM discussed the issue with Steve Fisher and advised that support of R-GMA is required whilst APEL is dependent on it. RM reported that he has spoken to SF and there is currently no change to the R-GMA situation - process ongoing. RM advised that a small amount of effort was going into R-GMA on APEL but for the long term he wasn't sure. The item needed to be kept here for review from time to time, and required to be re-visited around Easter 2009. AOB === DB reported that GridPP23 had been booked at Cambridge University, Clare College for 7-10 September 2009. The Collaboration Dinner would be held at Peterhouse College. The next PMB would take place on Monday 16 March at 12:55 pm.