GridPP PMB Minutes 338 - 16th February 2009 =========================================== Present: David Britton, Tony Doyle (Chair), Sarah Pearce, Roger Jones, Pete Clarke, Jeremy Coles, Steve Lloyd, Glenn Patrick, Andrew Sansum (Suzanne Scott - Minutes) Apologies: David Kelsey, Tony Cass, Robin Middleton, John Gordon, Dave Colling, Neil Geddes 2. JSPG Security Policy ======================== TD noted that DK had circulated an email regarding two new replacement security policies. Comments were invited up to the end of February. It was asked whether these had been discussed with sites and VO Managers? Not as yet. DB advised that GridPP should ask the VO Managers to assure us that the procedure as outlined is satisfactory and can be followed, especially for the smaller VOs. TD asked if we had a list of VOs that we support in the UK, and were their Managers aware of their new responsibilities? JC advised yes to the first part, but no to the second. TD asked if we had a Managers' List for VOs? AS advised that we must have contact details available from the VO card. JC noted they were listed at VOMS. ACTION 338.1 JC to contact GridPP VOs to ask if they are aware of the two new security policy documents plus the additional responsibilities involved with the VO Managers' roles? JC to cross-check with ATLAS, LHCb and CMS. ACTION 338.2 GP to send a message to the UB which would also reach the VO Managers. 4. Week's Notes ================ TD noted that Greig Cowan, Yves Coppens, Jon Wakelin and Keith Sephton were all leaving GridPP. The PMB noted that the project would miss the enormous contributions that Greig had made in the storage area and wished to express their thanks. They fully understood and supported the rationale behind his move to a physics role and wished him the best of luck. The PMB also appreciated the efforts Edinburgh were making to ensure that the transition would be as smooth as possible. The PMB noted that Yves Coppens at Birmingham had played a central role in pre-testing releases, maintaining the Birmingham site (including integration with their new shared clusters) and improving SouthGrid's overall performance. Jon Wakelin had been instrumental in deploying Storm at Bristol and integrating the Grid with the shared eScience cluster. It was noted that Keith Sephton was also leaving, and the PMB expressed its appreciation for his valuable work on the WMS and SunGrid Engine. The PMB wished to express their thanks to all, and to wish them the best for the future. 3. CASTOR Database upgrade =========================== AS reported that a major hardware upgrade would be needed on the CASTOR core database hardware. A possible three-day downtime may well be needed to carry out this work. They were still reviewing options as to how they could carry out the upgrade in the least intrusive manner. An internal meeting was scheduled for Tuesday. AS noted that originally a five-day period of downtime had been envisaged, but he felt this was being ultra-cautious on contingency. AS advised that the team were debating this issue at the moment, but intervention in the following areas were likely: a) a single RAID-array was deemed to be a single point of failure - this was being replaced by 2 b) because of the way the DB was configured, it was not possible to do a transparent intervention c) backups before and after migration were required d) sufficient notification (~1 month) was required to all experiments AS noted that he fully expected this downtime to happen. RJ noted his extreme concern - and asked why this could not be done when the move happened? AS felt this was not a good idea, as if things didn't work following the move, it would be difficult to pinpoint why. DB agreed, but noted that downtime needed to be done in conjunction with the experiments (who were the clients). RJ would advise AS of a possible time-period. This would be revisited next week. AS would circulate a summary note following the internal meeting on Tuesday. 4. Week's Notes ================ Re GridPP23 at Cambridge, JC advised that he was trying to contact the College concerned via Andy Parker, and was currently awaiting a response. Re NGS-3 funding, it was noted that there was nothing further to report at present. 1. 08Q4 Quarterly Reports Summary ================================== SP had circulated an email summary and this was discussed section by section. There was a particular discussion about job failures at ATLAS. RJ advised that the job failures were dominated by access to the storage. The storage access was - according to the storage review - mainly a problem with using SRM to access the data and the database back ends. They had just written and implemented a local version of the 'data mover' that used non-grid tools, and this seemed on first experience to work better. Time would tell - though the ATLAS OSC noted this as a move away from Grid solutions, and there was concern about the effort to 'fix' on a site by site basis like this. The PMB agreed with this concern and advised that it could assist by applying pressure to ensure that the SRM was upgraded across the board. RJ advised that a 'standard' solution at the back end of the SRM would be the best solution for all, instead of organising locally by experiment. TD noted that a joint team response was required with the Tier-1 and ATLAS. AS advised that it should be raised at the MB. ACTION 338.3 DB to raise the issue at the MB that, in response to job failures at ATLAS, a 'standard' solution at the back end of the SRM would be the best solution for all, instead of organising locally (implementing 'data movers' using non-grid tools) by experiment. In relation to the metric for 'Time sites on VO blacklists', it was agreed to add a 'metric 3' to the experiment boxes in relation to blacklisting. ACTION 338.4 JC to raise at dTeam the issue of blacklisting, suggesting delegated authority to experiments to provide blacklisting metric information - metric is required similar to freedom-of- choice tool. In relation to storage & data management, it was noted that Atlas space tokens were published by the SEs via their information systems. (CMS do not require space tokens at T2s, and LHCb do not require storage at T2s.) 78% of space tokens deployed correctly; 14 reported incorrectly; 14 missing. Among those published correctly, 4 did not have sufficient capacity to be usable by Atlas. ACTION 338.5 JC to raise at dTeam the storage issue of space-tokens reporting, for dTeam to follow-up (14 reported incorrectly; 14 were missing; among those published correctly, 4 did not have sufficient capacity to be usable by Atlas). In relation to GridMon, it was noted that Gridmon was operational but data was not being transferred from the nodes to the central database in some instances. It was agreed to raise this with Robin Tasker as it was considered to be broken at present. The Tier-1 required GridMon but GridPP needed to follow this up. ACTION 338.6 SP to ask Robin Tasker/Mark Leese in relation to the Gridmon red metric on the Project Map, and discuss a plan for resolution. In relation to staffing and red metrics at Tier-1, it was noted that the Tier-1 still had not recruited up to the original GridPP plan level of 17FTE. AS reported: a) 3rd team member of production team. Have failed twice on advertisement and once on agency rout (3 sets of interviews). We now have another shortlist (via new agenecy) and interview on 24th February. Shortlist looks good and I expect we will be successful. April start date? b) Experiment Support posts (50% funded by T1 and 50% by experiments). Here covering T1 effort only - recently interviewed. i) 0.5 FTE Just about to make one formal offer subject to work permit. April-May start date? ii) Did not offer on second position, but are now in informal discussion with likely applicant. Have yet to decide how we can officially restart the recruitment. In relation to Milestones, it was noted that Tier-1 had substantially failed to meet these targets. The red milestones did not in the main relate to LHC delays and it was considered a matter of extreme concern that the Tier-1 had not met the required targets in Q4. Outstanding issues were as follows: Disaster and business continuity plan available Recruitment (see above) Review of overall effectiveness of experiment support Provide site dashboard for experiments R89 available for installation 2008 disk/CPU received 2008 disk/CPU hardware accepted and bill paid Migration to 64 bit The PMB required AS to provide a 'planning review and report' on each of these 8 issues in turn, showing in detail why these were not met. In addition, a definite plan for completion was also to be provided. Discussion would follow provision of these reports. ACTION 338.7 AS to provide a 'review and report' on each of the 8 issues in turn, where the Tier-1 failed to meet the Q4 milestones, showing in detail why these were not met. In addition, a definite plan for completion of each was also to be provided. Discussion would follow provision of these reports. STANDING ITEMS ============== SI-1 Tier-1 Manager's report ----------------------------- AS reported as follows: Fabric: 1) R89 Machine Room. The building now has a fire certificate and has passed other building control milestones. The main outstanding issue now is for the airconditioning system to meet our acceptance criteria. Indications are that the air conditioning issue is real undeperformance (not a protocol/measurement problem) work is underway from both sides to understand the issue. 2) Migration to R89. The uncertainty in when R89 will become available has had a knock on 'planning blight' affect on our other plans. In order to regain some certauinty in planning we now (irrespective of when R89 actually turns up) plan to move the Tier-1 migration back to the second half of June. This is almost the latest that we can schedule this move and still deliver stable operations for LHC data taking and experiment requirements for stability. Before fixing on this date we have to address financial issues regarding spend on migration in FY 2009, once this is done a proposed new Tier-1 migration schedule will be announced. We will review plan B planning for the continuation of the Tier-1 service in the absence of R89. 3) Disk and robotics deliveries are pending on R89 availability. CPU deliveries will soon be in the same situation. 4) Puchasing of remaining items on spend plan is progessing. Staff: Summary of staffing position: 1) Recruitments outstanding to reach original GRIDPP plan of 17 FTE a) 3rd team member of production team.Have failed twice on advertisment and once on agency rout (3 sets of interviews). We now have another shortlist (via new agenecy) and interview on 24th February. Shortlist looks good and I expect we will be successful. April start date? b) Experiment Support posts (50% funded by T1 and 50% by experiments). Here covering T1 effort only. recently interviewed. i) 0.5 FTE Just about to make one formal offer subject to work permit. April-May start date? ii) Did not offer on second position, but are now in informal discussion with likely applicant. Have yet to decide how we can officially restart the recruitment. 2) GRIDPP agreed that we can raise maximum staff level to +1 above 17 FTE level in order to better hit 17 FTE target allowing for staff losses in FY09/FY10 and some underbooking in FY08 beyond planned underbooking. Cheney Ketley will start booking against GRIDPP in April 2009. Cheney will work on a mix of Fabric and CASTOR work. 3) GRIDPP agreed that we could raise average booking to 18 FTE (we will staff at peak of 18 +1 in order to hit average of 18). We are still discussing the details here (and John Gordon is actioned to report to the PMB on this). However as we reported at the CASTOR review our problem was that other projects were subsidising CASTOR and this funding will reduce. We therefore expect that we will increase bookings on the CASTOR line by +1 for this work in April by reallocating GRIDPP funding to existing staff. We also plan to retrain one existing staff member to better handle the interaction between CASTOR and Oracle. 4) Outside GRIDPP we are just about to commence recruitment of a PPS manager - a significant part of their work will be to operate a CASTOR PPS instance to improve testing. Service: 1) SAM availability last week was 100%. SAM availability for January was 100%. 2) CASTOR a) We continue to chase the big ID problem and have sent some dumps to Oracle (but need to aquire further debug info). This problem is impacting availability for ATLAS (it is also slightly impacting LHCB). b) A major hardware upgrade will be needed on the CASTOR core database hardware. A possible three day downtime may well be needed to carry out this work. We are still reviewing options as to how we can carry out the upgrade in a less intrussive manner. An internal meeting is scheduled for Tuesday. c) A CASTOR face to face meeting will be held at RAL this week. The main things we are trying to get out of this meeting are: (not in prio order): i) A better understanding of how CERN do their monitoring, and while accepting that we do it differently (for good reasons), consider if there are ways we can learn from them and improve what we do. ii) A catch up session on tapes and database operations iii) A session on disk server deployment, to understand how CERN (and other sites) do theirs and why, and what the fabric team can learn from this iv) A session on load testing, to consider how we can best optimise the required improvement in CASTOR certification test bed load test (as discussed at the OC committee). v) Off-line, we will also be discussing support for CASTOR 2.1.7, in the light of RAL's plans and CERN's plans, and the current agreement for CERN to support the latest 2 versions. Our aim here is to ensure CERN will continue to support 2.1.7 as long as we need to. d) The ALICE instance has been provisioned with 11 xrootd disk servers. e) ATLAS tape stagein tests worked well last week. Tape drives were found to be operating at close to our long range capacity planning levels. More details have emerged regarding frequency and duration of ATLAS reprocessing runs, providing useful input into the calculation of number of tape drives we will need. f) CASTOR will be upgraded to 2.1.7-24, SRM 2.7-15 downtimes planned are: 23,24Feb 09 ATLAS downtime due to upgrading to CASTOR 2.1.7-24, SRM 2.7-15, kernel upgrades 2 Mar 09 LHCb downtime due to upgrading to CASTOR 2.1.7-24, SRM 2.7-15, kernel upgrades 3 Mar 09 CMS downtime due to upgrading to CASTOR 2.1.7-24, SRM 2.7-15, kernel upgrades 5 Mar 09 Gen downtime due to upgrading to CASTOR 2.1.7-24, SRM 2.7-15, kernel upgrades 3) An upgrade to FTS 2.1 is scheduled for 23 February. SI-2 ATLAS weekly review & plans --------------------------------- RJ reported as follows: ATLAS job failures were dominated by access to the storage. The storage access was - according to the storage review - mainly a problem with using SRM to access the data and the database back ends. We have just written and implemented a local version of the 'data mover' that uses non- grid tools, and this seemed on first experience to work better. Time will tell - though the ATLAS OSC noted this as a move away from Grid solutions, and worried about the effort to 'fix' on a site by site basis like this. Improvements: Tier-1 data acceptance from CERN/Tier-1s and Tier-2s - up from 79% to 94% Tier-2 data acceptance from Tier-1 - up from 65% in Q2 to 94% in Q4. RJ considered that the previous numbers were dominated by issues at individual sites. As these got fixed, the figures improved. They also weighted the site by the expected ATLAS fraction, so some sites with problems (like Manchester for example) had a big effect. SI-3 CMS weekly review & plans ------------------------------- DC reported in absentia: Tier 1 ====== (Status sent by Chris to FacOps last Friday) o 3 Problems with CASTOR - A repeat of the Big IDs problem causing exports to fail - A problem with the user mappings which stopped imports for a while - An unknown problem that stopped exports o User mappings for /production and local phedex changed to cmdprg account and permissions on the /store/data and /store/mc areas changed to 755 (Now I've done this I'll start chasing the other T1s to do it). o CASTOR outage that was postponed from this week at DataOps request is now scheduled for 3rd March o Production "suffered" a bit from getting too many slots over the weekend, so going over the CMS fairshare and then getting starved of slots when Atlas came back online. Seem to be getting CMS jobs starts again now. Tier 2s ========= o Few minor problems with the IC CE which caused some analysis jobs to fail. Being investigated. o In the process of brringing Oxford online as T3/opportunistic T2 o ScotGrid -ECDF: dpm Certificate and job abort problems solved. Failing sam test for one remaining issue: stageout with lcg-cp failing possibly due to version installed at ecdf not liking format of source turl (file:////... - as used in srmcp). Need to verify and change sam test - also try to update version at ecdf - Try to get ball rolling at Durham and Glasgow this week. Note for them: space tokens not required though we can use them - though the implementation is fairly new so likely to be buggier than normal. SI-4 LHCb weekly review & plans -------------------------------- GP reported as follows: 1) Problem with a disk server at RAL affecting some transfers on Friday (13 Feb). Fixed same day. 2) Small-scale FEST-like tests were done last Wednesday & Thursday (in preparation for next FEST week beginning on 2 March 2009) and were successful. Jobs at RAL currently dominated by user analysis. Very little production, until new versions of software are validated - 100Million events will need to be created for the next run of FEST. SI-5 Production Manager's report --------------------------------- JC reported as follows: 1) [This is relevant but already being discussed in the DB context so partly for information] There is a new discussion about possible accounting baseline discrepancies between sites. At this time it is not clear if it is due to use of CPU over WALL time data in the comparisions, incorrect published KSI2K values (or how these get used in the accounting) or farm capacity figures from gstat. One thing that is clear is that the APEL figures do not currently align well with the experiment figures as seen by ATLAS. The matter has been referred to the DB. 2) There are renewed concerns about the quality of the WMS service. Periods are seen but not understood where jobs stall; attempts to debug these have suggested the underlying problem probably rests with code lockups. We have for quite some time seen evidence of the problem from Steve's page: http://pprc.qmul.ac.uk/~lloyd/gridpp/atest.html and reported it, but now other countries are also seeing problems. Looking at Steve's results page for the last ten hours (for example) shows a problem for user analysis jobs 9 hours ago. 3) For sites upgrading to gLite 3.1 there is an issue at Cambridge in that the gLite 3.1 CE has never been successfully integrated with the Condor batch system. I am told the Condor part of the release is still in the PPS. The SA3 condor batch system team indicate that the release does not yet work correctly for WMS submission. This prevents us declaring that we are ready for gLite 3.0 to become obsolete (see last week's issues). This is being investigated in SA3 but the current advice seems to be to deploy a non-production ready release and debug it. 4) Several staff have recently or are about to leave GridPP and we should thank them for their very strong contributions. In particular, Yves Coppens at Birmingham has played a central role in pre-testing releases, maintaining the Birmingham site (including integration with their new shared clusters) and improving SouthGrid's overall performance. Jon Wakelin has been instrumental in deploying Storm at Bristol and integrating the Grid with the shared eScience cluster. 5) Data on an old classic SE at RAL will be removed shortly. Jens is attempting to clarify with the VOs concerned whether any of the data (probably from EDG days) is needed: Dteam, cms, Atlas 765GB ~ 500 files, Alice 7GB, zeus H1, LHCB. This raises the question about legacy data and prompts us to check the procedures used when removing SEs. SI-6 LCG Management Board report --------------------------------- No issues to report. SI-7 Dissemination Report -------------------------- SP reported that C Burne @ QMUL was going on maternity leave in May, therefore they would be recruiting a replacement. REVIEW OF ACTIONS ================= 332.1 AS to provide a plan for the tape drives, given the new information from IB - an detailed plan was required immediately, showing minimum spend relating to no data etc. O N G O I N G. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. O N G O I N G. 333.3 JC and DC to write a paragraph which summarises the Tier-2 and Tier-3 positions. This need not come back to the PMB for discussion, but can be circulated once done. Response from DC awaited. A summary should be written down so that we have something formal to refer to. JC reported that he had iterated with DC and a paragraph was currently with DC for approval. Done, circulated. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). Procedure was now reported as a 'standing item' on the dTeam list, but should be more regularly reviewed within the experiments. O N G O I N G. 337.1 DB to contact Malcolm Booy in relation to 'revised MoU costs' item and inform him that the error in the figures did not relate to 'costs' at all, and also did not affect the conclusion reached. Done. 337.2 DB to circulate the policy information re the prioritisation process to the PMB. Done. 337.3 JC & JG to form a Working Group with Andy Richards and Dave Wallom to define which Grid services could be run by a UK NGI post April 2011. O N G O I N G. 337.4 JG to circulate details of current plan and progress in relation to FTE effort on storage at RAL. O N G O I N G. 337.5 SP to request that, via Neasan O'Neill, the web page on KE and EI be made more prominent, and maintained. Done. 337.6 JC to follow-up at dTeam the problems/issues of two sites (UCL & Birmingham) not meeting EGEE targets. Birmingham had been investigated. JC to follow-up UCL. 337.7 DC to provide the CMS report to SP for the Quarterly Reports; A McNab to provide the Security Report. O N G O I N G. ACTIONS AS AT 16.02.09 ====================== 332.1 AS to provide a plan for the tape drives, given the new information from IB - an detailed plan was required immediately, showing minimum spend relating to no data etc. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). 337.3 JC & JG to form a Working Group with Andy Richards and Dave Wallom to define which Grid services could be run by a UK NGI post April 2011. 337.4 JG to circulate details of current plan and progress in relation to FTE effort on storage at RAL. 337.6 JC to follow-up at dTeam the problems/issues of two sites (UCL & Birmingham) not meeting EGEE targets. Birmingham had been investigated. JC to follow-up UCL. 337.7 DC to provide the CMS report to SP for the Quarterly Reports; A McNab to provide the Security Report. 338.1 JC to contact GridPP VOs to ask if they are aware of the two new security policy documents plus the additional responsibilities involved with the VO Managers' roles? JC to cross-check with ATLAS, LHCb and CMS. 338.2 GP to send a message to the UB which would also reach the VO Managers. 338.3 DB to raise the issue at the MB that, in response to job failures at ATLAS, a 'standard' solution at the back end of the SRM would be the best solution for all, instead of organising locally (implementing 'data movers' using non-grid tools) by experiment. 338.4 JC to raise at dTeam the issue of blacklisting, suggesting delegated authority to experiments to provide blacklisting metric information - metric is required similar to freedom-of- choice tool. 338.5 JC to raise at dTeam the storage issue of space-tokens reporting, for dTeam to follow-up (14 reported incorrectly; 14 were missing; among those published correctly, 4 did not have sufficient capacity to be usable by Atlas). 338.6 SP to ask Robin Tasker/Mark Leese in relation to the Gridmon red metric on the Project Map, and discuss a plan for resolution. 338.7 AS to provide a 'review and report' on each of the 8 issues in turn, where the Tier-1 failed to meet the Q4 milestones, showing in detail why these were not met. In addition, a definite plan for completion of each was also to be provided. Discussion would follow provision of these reports. 338.8 TD to respond to DB regarding Ian Bird's nomination to the e-Science Panel. He had not been nominated, and what was the closing date? 338.9 AS to circulate an email in due course relating to the PMB decision that the end of June was the latest date for migration to R89 - beyond which the move would not happen in 2009. 338.10 JC to discuss certificate reminders with Jens (some people were not receiving reminders of expiry). INACTIVE CATEGORY ================= 282.8 RM to monitor how R-GMA and networking issues impact on GridPP as matters progress. RM advised that this item should be moved to the 'inactive' category as it will develop over the coming months. RM discussed the issue with Steve Fisher and advised that support of R-GMA is required whilst APEL is dependent on it. RM reported that he has spoken to SF and there is currently no change to the R-GMA situation - process ongoing. RM advised that a small amount of effort was going into R-GMA on APEL but for the long term he wasn't sure. The item needed to be kept here for review from time to time, and required to be re-visited around Easter 2009. AOCB ==== 1. DB had asked if anyone had nominated Ian Bird to the e-Science Panel? No-one had. TD would respond to DB. 2. Re R89, AS asked that the PMB agree a late of late June for latest migration, behond which the move would not happen. This was agreed. TD summarised that the end of June was the final possible date. AS would circulate an email accordingly. 3. SP noted she had not received a certificate reminder - JC would discuss with Jens. Next PMB would be held at 12:55 pm on Monday 23rd February.