GridPP PMB Minutes 339 - 23rd February 2009 =========================================== Present: David Britton (Chair), Tony Doyle, Sarah Pearce, Jeremy Coles, Steve Lloyd, Andrew Sansum, Robin Middleton, John Gordon, Dave Colling (Suzanne Scott - Minutes) Apologies: David Kelsey, Tony Cass, Pete Clarke, Glenn Patrick, Roger Jones, Neil Geddes 1. CASTOR Database Upgrade plan ================================ AS reported that he had looked at the upgrade in detail - there appeared to be no technical problems with the plan. The length of downtime, however, was ill-constrained and an estimate only. AS advised that hardware will be put in place and the migration tested, which would give an indication of how long the process would take. The hope was to get the testing done by mid- March. DB asked that AS co-ordinate with the three main experiments in relation to downtime. 2. Planning for the R89 move ============================= DB noted that there had been discussion last week about a date - this should neither be too early nor too late - were there two dates required (a planned date and a backup "latest possible" date)? AS advised that they were only going to move on the last possible date, the end of June had been proposed which allowed 10 weeks from the end of the move to collisions. DB noted that given the track record from December until now, this may not be enough time. DB asked who was managing the situation day-to-day? AS advised that the Group Leader was the Machine Room Manager, Graham, along with the Head of Building Projects. AS advised that himself, Graham, and JG as Project Sponsor, were all putting on pressure daily in relation to equipment depreciation and lost funding as a result of the delay. DB discussed whether there was any action GridPP could take to help pressure for the completion of R89. 3. This week's notes ===================== - re GridPP23 @ Cambridge: JC reported that the most likely option was 7-11 September at Peterhouse College. - NGS-3 funding: there was no further info available at present. Re the Agenda at GridPP22 @ UCL, DB advised that the theme was on Service Resilience and Disaster Planning - this would be a high-level and a detailed view. Re names for speakers, Wednesday was the experiment views: DB & JC would contact the experiment reps, talks were to be structured around a number of questions. Would DC iterate for CMS? DC suggested Stuart, Chris Brew or himself. DB noted that feedback was required this week. DC would check with Stuart & Chris. For LHCb it would be Raja Nandakumar who would give the talk. For ATLAS it would likely be Graeme or Peter but this needed confirmation with RJ. GP had to advise re 'other experiments'. ACTION 339.1 DC to advise DB this week which individual would do the CMS experiment talk at GridPP22. 339.2 GP to advise DB this week which individual would do the 'other experiments' talk at GridPP22. DB advised that the following session was a one-hour discussion on GridPP relating to facilitating experiment plans - some themes and threads were to be interwoven. Before the end of the session, Will Venters would give a talk on the Pegasus results. The next day would comprise 2 x Tier-1 talks - DB needed suggestions for two speakers (2 x 30- minute slots). AS advised that he would do the 'disaster planning' one, and would get another speaker. ACTION 339.3 AS to advise DB who the other speaker would be for the 2nd Tier-1 talk at GridPP22. DB noted that after that there would be a security talk - suggestions? Mingchao Ma, DK or Roman Wartle were suggested. DB would contact DK. JC suggested that it would be preferable to have someone from outside and supported inviting Roman; it was good to get people with different perspectives. ACTION 339.4 DB to contact DK to discuss the individual who could give the security talk at GridPP22. DB noted the Tier-2 talks would be followed by network resilience, which Robin Tasker had agreed to do. In the afternoon there would be 'micro-talks' plus discussion (2-3 slides only) on external services: CA, VOMS, GOC helpdesk or APEL. DB would contact JG. ACTION 339.5 DB to contact JG in relation to the 'micro-talks' on external services (CA, VOMS, GOC helpdesk or APEL) at GridPP22. DB advised that the remainder of the session would be on future plans - EGI & EGEEIII. This would be covered by NG and/or RM. DB would send an email to both. DB noted that JG would possibly do a talk on UK/NGI. ACTION 339.6 DB to email NG and RM in relation to the session at GridPP22 on plans for EGEEIII/EGI. SL agreed to sort out the DB Agenda and advise. ACTION 339.7 SL to advise the DB Agenda for GridPP22. STANDING ITEMS ============== SI-1 Tier-1 Manager's report ----------------------------- AS reported as follows: Fabric: 1) R89 Machine Room. The main outstanding issue now is for the airconditioning system to meet our acceptance criteria. Progress is being made and a clearer picture is emerging. It is not possible yet to say how rapidly the problem can be resolved. We have no date by when we expect the building to become available. 2) Migration to R89. The provisional date for the Tier-1 migration to R89 (subject to R89 availability) is the second half of June. An exact date will be announced shortly. We will review plan B planning for the continuation of the Tier-1 service in the absence of R89. 3) Disk and robotics deliveries are pending on R89 availability. CPU deliveries will soon be in the same situation. 4) Puchasing of remaining items on spend plan is progessing. Staff: Summary of staffing position: 1) Recruitments outstanding to reach original GRIDPP plan of 17 FTE a) 3rd team member of production team. Interviewing Tuesday. b) Experiment Support posts (50% funded by T1 and 50% by experiments). Here covering T1 effort only. recently interviewed. i) 0.5 FTE Just about to make one formal offer subject to work permit. April-May start date? ii) Did not offer on second position, but are now in informal discussion with likely applicant. Have yet to decide how we can officially restart the recruitment. Service: 1) SAM availability last week was 100%. 2) CASTOR a) We continue to chase the big ID problem and have sent some dumps to Oracle (but need to aquire further debug info). b) A major hardware upgrade will be needed on the CASTOR core database hardware. An initial estimate of 3 days was given at last week's PMB meeting. The workplan for this upgrade was reviewed last Tuesday and we concluded that further testing was needed in order to better estimate the length of the required downtime. Testing is planned to be completed by the end of March at which point we will be able to make a more accurate assesment. c) A CASTOR face to face meeting was held at RAL last week (no update available). d) CASTOR will be upgraded to 2.1.7-24, SRM 2.7-15 downtimes planned are: Underway ATLAS downtime due to upgrading to CASTOR 2.1.7-24, SRM 2.7-15, kernel upgrades 2 Mar 09 LHCb downtime due to upgrading to CASTOR 2.1.7-24, SRM 2.7-15, kernel upgrades 3 Mar 09 CMS downtime due to upgrading to CASTOR 2.1.7-24, SRM 2.7-15, kernel upgrades 5 Mar 09 Gen downtime due to upgrading to CASTOR 2.1.7-24, SRM 2.7-15, kernel upgrades 3) An upgrade to FTS 2.1 is underway. SI-2 ATLAS weekly review & plans --------------------------------- RJ was absent. SI-3 CMS weekly review & plans ------------------------------- DC reported that CMS was relatively quiet at the moment, they had minor problems at Imperial with the CE, the Tier-1 was ongoing - ATLAS hadn't been using slots so CMS had been using them instead, however they were bumped-off once the fairshares had been applied when ATLAS wanted back on. DC commented that it would be good to take back the debt over a period of time. AS noted that he was not sure this was possible, but it could be discussed at the next Tier-1 liaison meeting in March. It was understood that it was for the experiments to say how they wanted the shares allocated, and adjustment might be possible. SI-4 LHCb weekly review & plans -------------------------------- In absentia, GP provided the following report: No major issue to report this week. There have been problems at QMUL & Birmingham and elsewhere, which are being resolved by GGUS tickets. Mostly a problem with nfs (scaling issues of shared software area under heavy load). Outlook: Continuing MC productions and user jobs. SI-5 Production Manager's report --------------------------------- JC reported as follows: 1) The accounting problems mentioned last week (QMUL figures looked too high for ATLAS) appear to be down to understanding the nature of the published numbers used for the cross- checks. The numbers are re-normalised to the clock frequency of the CPU core and with this clarified there is not a large descrepancy with the ATLAS figures which use "raw CPU seconds". This now raises a new question about why some other site accounts are not higher! Although cross-checking accounting figures is part of the Quarterly Reporting task for each Tier-2, checking that the figures are close to the experiment "experience" is a task that can and should be carried out. Therefore an accounting review seems appropriate with a focus on Q408. DB noted that we needed to have a uniform view across sites for the purposes of the Tier-2 hardware allocation. 2) Kashif Mohammad in Oxford (SouthGrid EGEE coordinator) has now enabled a first (prototype) version of the GridPP/UKI Nagios. The next step is for site admins to look at the site results and provide feedback (DTEAM review tomorrow). For the service status use this link https://gridppnagios.physics.ox.ac.uk/nagios/ and click on the links under Monitoring. Hostgroup summary and Service Detail are two places to look first. Note that there is a lot of information here and we have yet to clean-up the look and data (so do not be alarmed by the red boxes!). 3) There are new reports of job efficiency concerns - not just that jobs are (in)efficient or otherwise, but that the site tools do not always give expected results. As a consequence some jobs are reported as efficient (by say qstat) when they are stuck and other jobs that are stuck look efficient! We will look at ways to assess the batch system information for reliability and decide what can be done - how is the misinformation causing a problem in itself. As in 1) it may be useful to carry out an experiment-site results comparison. SI-6 LCG Management Board report --------------------------------- DB had not been able to attend, and JG was absent today. SI-7 Dissemination report -------------------------- SP reported that Neasan O'Neill had done a news item re the ATLAS Tutorial; he was also getting ready for the EGEE User Forum in Catania next week. REVIEW OF ACTIONS ================= 332.1 AS to provide a plan for the tape drives, given the new information from IB - an detailed plan was required immediately, showing minimum spend relating to no data etc. O N G O I N G. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. O N G O I N G. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). O N G O I N G. 337.3 JC & JG to form a Working Group with Andy Richards and Dave Wallom to define which Grid services could be run by a UK NGI post April 2011. JC reported that he had met with Andy Richards and Dave Wallom following an NGS-3 meeting last Tuesday. Note that GridPP would be part of the NGI so the current approach was looking at what services would be run assuming an NGI-EGI model and then who would be responsible for what. Most of the services *could* be run by a UK NGI but what was actually possible depended on the funding level. DONE, item closed. 337.4 JG to circulate details of current plan and progress in relation to FTE effort on storage at RAL. O N G O I N G. 337.6 JC to follow-up at dTeam the problems/issues of two sites (UCL & Birmingham) not meeting EGEE targets. Birmingham had been investigated. JC to follow-up UCL. JC reported that UCL had a mixture of problems. The last week of availability from SAM ops was 99%. Earlier problems related to the queue structure. The site also experienced high loads during a testing period causing expiry of user proxies. Finally, some tests failed as not all the CA RPMs were up-to-date. DONE, item closed. 337.7 DC to provide the CMS report to SP for the Quarterly Reports; A McNab to provide the Security Report. DONE, item closed. 338.1 JC to contact GridPP VOs to ask if they are aware of the two new security policy documents plus the additional responsibilities involved with the VO Managers' roles? JC to cross-check with ATLAS, LHCb and CMS. JC reported that GP mailed the GridPP User Board last week. JC contacted GridPP VOMS based VOs via the Broadcast tool though this has uncovered several VOs without proper representation in the CIC portal ID cards (ltwo; cedar; ralpp; manmace and ukqcd). JC also mailed the UK LHC experiment representatives but it was clear everyone was already covered... but since the request was for a cross-check this has been done. DONE, item closed. ACTION 339.8 JC to follow-up VO Registration cards. 338.2 GP to send a message to the UB which would also reach the VO Managers. DONE, item closed. 338.3 DB to raise the issue at the MB that, in response to job failures at ATLAS, a 'standard' solution at the back end of the SRM would be the best solution for all, instead of organising locally (implementing 'data movers' using non-grid tools) by experiment. O N G O I N G. 338.4 JC to raise at dTeam the issue of blacklisting, suggesting delegated authority to experiments to provide blacklisting metric information - metric is required similar to freedom-of- choice tool. O N G O I N G. 338.5 JC to raise at dTeam the storage issue of space-tokens reporting, for dTeam to follow-up (14 reported incorrectly; 14 were missing; among those published correctly, 4 did not have sufficient capacity to be usable by Atlas). DONE, item closed (but issues ongoing within the Storage Group). 338.6 SP to ask Robin Tasker/Mark Leese in relation to the Gridmon red metric on the Project Map, and discuss a plan for resolution. SP spoke to RT and is awaiting a response. DONE, action closed. 338.7 AS to provide a 'review and report' on each of the 8 issues in turn, where the Tier-1 failed to meet the Q4 milestones, showing in detail why these were not met. In addition, a definite plan for completion of each was also to be provided. Discussion would follow provision of these reports. O N G O I N G. 338.8 TD to respond to DB regarding Ian Bird's nomination to the e-Science Panel. He had not been nominated, and what was the closing date? IB had been nominated by JG & SL. DONE, item closed. 338.9 AS to circulate an email in due course relating to the PMB decision that the end of June was the latest date for migration to R89 - beyond which the move would not happen in 2009. O N G O I N G. 338.10 JC to discuss certificate reminders with Jens (some people were not receiving reminders of expiry). DONE, item closed. ACTIONS AS AT 23.02.09 ====================== 332.1 AS to provide a plan for the tape drives, given the new information from IB - an detailed plan was required immediately, showing minimum spend relating to no data etc. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). 337.4 JG to circulate details of current plan and progress in relation to FTE effort on storage at RAL. 338.3 DB to raise the issue at the MB that, in response to job failures at ATLAS, a 'standard' solution at the back end of the SRM would be the best solution for all, instead of organising locally (implementing 'data movers' using non-grid tools) by experiment. DB to contact RJ to clarify this issue. 338.4 JC to raise at dTeam the issue of blacklisting, suggesting delegated authority to experiments to provide blacklisting metric information - metric is required similar to freedom-of- choice tool. JC reported that this was raised at last week's DTEAM meeting. Only ATLAS was represented and the response was that ATLAS could do it, but none of the status changes were recorded at the moment, so would require development from the panda people. JC would raise the issue again this week. 338.7 AS to provide a 'review and report' on each of the 8 issues in turn, where the Tier-1 failed to meet the Q4 milestones, showing in detail why these were not met. In addition, a definite plan for completion of each was also to be provided. Discussion would follow provision of these reports. 338.9 AS to circulate an email in due course relating to the PMB decision that the end of June was the latest date for migration to R89 - beyond which the move would not happen in 2009. 339.1 DC to advise DB this week which individual would do the CMS experiment talk at GridPP22. 339.2 GP to advise DB this week which individual would do the 'other experiments' talk at GridPP22. 339.3 AS to advise DB who the other speaker would be for the 2nd Tier-1 talk at GridPP22. 339.4 DB to contact DK to discuss the individual who could give the security talk at GridPP22. 339.5 DB to contact JG in relation to the 'micro-talks' on external services (CA, VOMS, GOC helpdesk or APEL) at GridPP22. 339.6 DB to email NG and RM in relation to the session at GridPP22 on plans for EGEEIII/EGI. 339.7 SL to advise the DB Agenda for GridPP22. 339.8 JC to follow-up VO Registration cards. INACTIVE CATEGORY ================= 282.8 RM to monitor how R-GMA and networking issues impact on GridPP as matters progress. RM advised that this item should be moved to the 'inactive' category as it will develop over the coming months. RM discussed the issue with Steve Fisher and advised that support of R-GMA is required whilst APEL is dependent on it. RM reported that he has spoken to SF and there is currently no change to the R-GMA situation - process ongoing. RM advised that a small amount of effort was going into R-GMA on APEL but for the long term he wasn't sure. The item needed to be kept here for review from time to time, and required to be re-visited around Easter 2009. DB proposed that there be no PMB next Monday, given that people were away and the EGEE User Forum was happening. It was agreed that the next PMB Meeting would be at 12:55 pm on Monday 9th March.