GridPP PMB Minutes 341 - 16th March 2009 ======================================= Present: David Britton (Chair), Sarah Pearce, Jeremy Coles, Andrew Sansum, Pete Clarke, Robin Middleton, John Gordon, Roger Jones, Neil Geddes,Tony Doyle, David Kelsey, Tony Cass. Apologies: Glenn Patrick, Steve Lloyd, Dave Colling, Suzanne Scott. 1. R89 Status ============== Independent expert was to be appointed by the end of this week to advise on R89 air conditioning problems. Report hoped for by 1st of May. If problem is "trivial" (requires ~week to fix) then there will still be time to move the Tier-1. If the problem is more serious or the report is delayed, then we will have to go stay put in the ATLAS centre with a split-tier1 as a long term reality. DB asked whether we can see a silver-lining in terms of resilience (having two locations where some critical services can be duplicated?). AS agreed that there could be some positives. DB asked about re-coup of extra costs required to do essential hardening of services remaining in the ATLAS center (such as more UPS)? ACTION: AS to review reslience of services that may have to remain in the ATLAS building. 2. JRU Meeting(s) ================== RM reported on this morning's JRU meeting (Joint Research Unit under which the EGEE groups operate). JISC view the NGS as the natural body to evolve into an NGI for involvement in EGI. Things are happening quickly with talk of an LOI/MOU process in the next 2-3 months resulting in a committment to pay 70K-Euro this coming year. There was a discussion about how JISC-STFC- EPSRC were being involved and how the interest of GridPP communities were being represented. DB had proposed a shadow NGI management board be created to manage this process but current approach was to use the exisiting JRU management board as the JRU was a recognised legal entity. DB's concerns about representation will be addressed in the short-term by inviting additional people to the JRU meetings. The next meeting is scheduled for 4pm on Thursday 2nd April at UCL following the GridPP collaboration meeting (to be confirmed). ACTION: RM to contact Ben Waugh about room and phone/video facilites for this meeting. It was noted that there are two questionaires upcoming: 1) is an EGI questionaire asking about who runs wht services (EU-wide); 2) is an EGEE CB questionaire (currently in draft form) which asks (sites) about issues related to a no-cost extension of EGEE. ACTION: NG to consult/inform GRIDPP PMB on responses to questionaires. STANDING ITEMS ============== SI-1 Tier-1 Manager's report ----------------------------- AS reported as follows: Fabric. 1) R89 Machine Room. Work continues to understand the airconditioning problem. 2) Migration to R89. Planned for approximatly 2 weeks commencing 22nd June (provided R89 is available). Possible alternative plan B date for 1 week migration of critical components only - commencing 6th July. The building must be accepted by 1st May in order for us to schedule the machine room migration for 22nd June (our planned - latest possible date). 3) Disk and robotics deliveries are pending on R89 availability. CPU deliveries will soon be in the same situation. We plan to guarantee delivery of disk, CPU and robotics into either R89 or ATLAS for delivery no later than 30 May 2009. We are commencing work to install power in ATLAS centre in order to be able to do so. 4) Puchasing of remaining items on spend plan is progessing well. Staff Summary of staffing position: 1) Recruitments outstanding to reach original GRIDPP plan of 17 FTE a) 3rd team member of production team has accepted and will start 23rd March. b) Experiment Support posts (50% funded by T1 and 50% by experiments). Here covering T1 effort only. recently interviewed. i) 0.5 FTE Just about to make one formal offer subject to work permit. April-May start date? Negotiations continue. ii) Did not offer on second position, but are now in informal discussion with likely applicant. Have yet to decide how we can officially restart the recruitment (pending closure of first). Service ======= 1) SAM availability last week was 99%. 2) A site DNS server failed (caused by high load) on 9th March. This had a severe impact on the Tier-1 (particularly the ATLAS CASTOR instance) until early on 10th March when the problem was resolved. See incident report: http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20090309 We will probably install two local slave DNS within the Tier-1 3) CASTOR a) We are making little progress casing the big ID problem, but are continuing activly to persue this problem. It continues to impact the service. DB asked what had happened to the wLCG-ORACLE meeting that had been proposed for Feb? JG responded that CERN had not followed through on this in time and that it was being rescheduled. ACTION: JG to establish new date for the wLCG-ORACLE meeting. b) A major hardware upgrade will be needed on the CASTOR core database hardware. No date has been fixed (pending tests). c) We are investigating problems with hanging gridftp connections which are particularly impacting LHCB. We are preparing a nagiops test to detect and rectify this problem when it occurs. d) We are consulting on a possible upgrade to CASTOR 2.1.8 before data taking and when we have a recomendation to make we will be raising this at the PMB for a final decision. 4) WMS Instability We are suffering instability in the WMS service and have suffered several problems this week. Over the weekend a member of staff had to attend site to restart wms02. It was reported that many other sites (inc CERN) are also seeing instabilities following the recent WMS mega-patch upgrade. SI-2 ATLAS weekly review & plans --------------------------------- ATLAS re-processing has restarted. Going to be a test for production-queue by requiring 3MB of memory. Other than that, nothing ATLAS specific. SI-3 CMS weekly review & plans ------------------------------- In absentia, DC provided the following report: CMS status in the UK is not really remarkable. The T1 was hit by the DNS last week and Chris tells me that there was some SRM problem on Saturday but that all is green now. At the T2s we have had some appalling download rates to Brunel and Imperial which we need to investigate. The dcache at Imperial is improved for having more memory, but is still rather suffering under 1200 analysis jobs and is far from perfect. We need to upgrade the head node to a more powerful machine. It didn't help that all of these were accessing data stored on just one of the disk arrays and so everything was grinding very slowly. Imperial had gone from being consistently one of the best sites in CMS to being one of the worst because of these failures and we are keen to reverse this. Other than that things are pretty much ticking along. SI-4 LHCb weekly review & plans -------------------------------- In absentia, GP provided the following report: LHCb status: 1. Problem with hanging FTS transfers. Affects both the RAL and CERN Castor instances and is caused by various problems/issues/bugs coming together. An emergency patch to DIRAC to alleviate the problem is in production. In the meantime, Shaun kills the hanging transfers. This affects the uptime of the RAL (and CERN) SEs. 2. Bugs in glite WMS mean two CERN WMS-s (wms203 and wms216) are unusable. This severely affects throughput of jobs by LHCb. GGUS tickets are opened for this problem, but this service is covered only during working hours. Outlook : 1. Primarily user job load. LHCb production/simulation will not be run until major bugs in LHCb simulation software are fixed. 2. Some reconstruction jobs running as a part of tests to streamline the job processing framework. Concerns : 1. Two Tier-1 (CNAF, ASGC) sites down simultaneously due to various problems. 2. Two Tier-1s (CNAF, ASGC) have had (so-called low probability events) fires within the last 9 months. 3. Three Tier-1s (CNAF, IN2P3, NIKHEF) for LHCb currently unusable for various reasons. 4. CASTOR at CERN has had major problems over the last week, proving that the CERN SE can also have its bad days. SI-5 Production Manager's report --------------------------------- 1) Neasan has asked about our site support for external VOs. He notes that "a lot of work being done with EGEE is very interesting and easy to get media attention (e.g. the piece you may have seen in The Times last week about the resurrection of an ancient Greek instrument). However the work is being done with VOs that GridPP doesn't necessarily support e.g. EUMEDGRID, EELA and GILDA." Should GridPP sites once again be actively encouraged to support these sorts of VOs? DB suggested that since resources are currently under-used we should encourage VOs. JG suggested that we try and find those with some kind of UK involvement or connection as a prioity. JG was asked how he finds out about new VO's and their what they want? JG noted that he used to get this sort of request but more recently he had not. ACTION: JC to investigate how EGEE VO's request resources. 2) We held a UKI face-to-face meeting on Thursday to discuss regionalisation and a GridPP DTEAM workshop on Friday: http://indico.cern.ch/conferenceDisplay.py?confId=53442. There was a lot of discussion as you might imagine. The following points are to give a flavour of the discussions and I apologise that they are not well rounded - I'm working from pages of notes that need tidying. For the PMB we may like to only discuss items marked with * rather than -. Thursday ******** - Agreement on team structure [teams of 2 split for follow up and new tickets] - Clarification on ROD procedures - but we need to follow up procedures in some areas. Setup rota and extend training for EGEE coordinators - Concern about possible move back to a UK based helpdesk (currently use GGUS) - Reviewed regional dashboard and noted concerns/issues/questions. For example links to raise new tickets not present. * Does GridPP have use-cases for regional GOCDB (bolt on schema areas)? This would be information GridPP requires about its sites. This was discussed but no additional requirement was known. JG remarked that this was probably not suprising given that the UK had designed the original global schema. - Questions about VOMS policies such as data retention periods need to be reviewed. - Need further information about EGEE message bus (for use with Nagios) and how our testing would interface with central instances is required. * Action to follow up on UK SAM instance placement (at RAL?) This was discussed and RAL was the logical place (though Oxford leading in some ways). Issue is probably who can provide 24x7 support. However, not clear if this is a Tier-1, e-science (NGS) area. - Clarified UKI test instances for middleware releases. - Uncertainty about how the LHC experiments fit in the EGI world. Support units are the only clear interface. - Reviewed GridPP/UK Nagios instance. Discussed tests currently run and their usefulness. Open questions about the VO used for submission. Will explore how to post results to central SAM. T2Cs will acknowledge/disable tests per Tier-2. - Open questions remain around what we provide with our first line support. How proactive should the team be in helping with problems (note this varies in the current model)? - What training do UKI sysadmins want or require. Think again about extending HEPSYSMAN training. Friday: * Review of team structure. Question about the differences between the 3 storage posts. This was discussed: There are 2 storage and 1 data post covering dCache, DPM, CASTOR and datamanagement tools. SP suggested reviewing PRoject Map milestones and metrics to identify the person responsible. TD agreed but noted that it should wait till after the Edinburgh post is re- filled as currently the GU post is covering some of this area and the exact split should be optimised once the candidate's strengths are known. ACTION: SP to relate milestones/metrics to individuals where possible. - Considered dissemination options. Blogs are useful but lack proper search function. - Team would like further T1 representation/issues discussion * Web pages are in need of another review. GridPP website needs something akin to the WLCG directory (http://tinyurl.com/detqf9). Need a site map. It was noted that there was a resonably good site-map but that it was weak on the deployment side. * Questions asked about GridMon. Was useful but now not supported. ACTION: SP to chase with Robin Tasker and report back. - Monitoring needs to be brought closer to experiment/shifter view - Grid services were reviewed. No clear need for more WMSes BUT current WMSes have (internal) problems! - Will deploy one tBDII per Tier-2. Unclear if there are still load concerns at the Tier-1. - DTEAM view is that sites are ready for SL5 WN migration. There still seems to be conflicting demand (is SL5 needed urgently or gradually?) - There is a risk to GridPP if SGE is not sufficiently integrated/tested for CREAM. * Extended discussion on WN space requests. We need a clearer indication of requirements from the experiments. - Reviewed efficiency variation across sites. Some interesting differences that need to be followed up (for example Liverpool is >90% for every VO while most sites see 35%-90% depending on the VO). - Move to SpecHEP benchmark needs a clear procedure. * Implementation of resources for T3 (shared with T2) currently not well understood technically. ACTION: JC to suggest a more specific agenda item for the UCL DB meeting. 3) RAL Tier-1 experienced DNS problems on 9th March and this caused a variety of problems for the experiments. Other information: 4) Site roundup: i) Lancaster has had availability problems since the installation of a new core router and the tweaking of maui queue priorities. ii) ECDF affected by high-loads on SE. iii) Bristol seeing GPFS performance problems on its HPC cluster. iv) Problems at Oxford might be due to the CE root partition getting full. v) UCL-HEP is down while their DPM headnode is replaced. 5) There was a GDB at CERN last week: http://indico.cern.ch/conferenceDisplay.py?confId=45473. Roll out of 64-bit WNs on SL5 is expected to start during April. The UI will be the next ported component. The ALICE WMS experiences do not seem to be fully resolved following a mega-patch installation (hoped it would address many possible contributing factors). New security policy discussed: VO portal policy. Further feedback given on the User Level Job Accounting policy. 6) Consensus in the deployment team is that the next HEPSYSMAN meeting should be delayed until after HEPiX (Sweden, May). Dates to be proposed shortly. SI-6 LCG Management Board report --------------------------------- DB reported that the "GridPP post-mortems or Serious Incident Reports were praised and held out as a model for other wLCG Tier-1s. CERN had received heavy critisim for a badly executed network intervention that brought down CASTOR. RAL had been noted as one of two sites that "consistently failed the experiment tests" however this was based on a 2-week period when various upgrades (CASTOR) and some underlying LHCb problems had contributed significantly. AS note that the message here was that the Tier-1 should probably pay closer attention to these experiment-test plots. There were presentations on adapting SRM to report "storage-busy"; and a long presentation on Virtulisation and multi-cores which was a look ahead to potential gains (shared memory etc). SI-7 Dissemination report -------------------------- SP reported that Neasan O'Neill had been to the EGEE user forum and had written a news item http://www.gridpp.ac.uk/news/-1236761545.114420.wlg. REVIEW OF ACTIONS ================= 332.1 AS to provide a plan for the tape drives: The experiments would be producing new numbers by ~end of March which would enable this action to progress. O N G O I N G. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. PC had sent info round, and Robin Tasker would provide further info at GridPP 22 UCL. The issue would need to be referred back to the OC. Ongoing. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). Ongoing. 337.4 JG to circulate details of current plan and progress in relation to FTE effort on storage at RAL. Ongoing. 338.3 DB to raise the issue at the MB that, in response to job failures at ATLAS, a 'standard' solution at the back end of the SRM would be the best solution for all, instead of organising locally (implementing 'data movers' using non-grid tools) by experiment. DB to contact RJ to clarify this issue. DB had contacted RJ about this. DB had emailed RJ about this and felt that he should not raise ATLAS issues at the MB. RJ agreed and the action is C L O S E D. 339.3 AS to advise DB who the other speaker would be for the 2nd Tier-1 talk at GridPP22. AS announced that Martin Bly will talk. ACTION C L O S E D. 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. Ongoing. ACTIONS AS AT 16.03.09 ====================== 332.1 AS to provide a plan for the tape drives: The experiments would be producing new numbers by ~end of March which would enable this action to progress. O N G O I N G. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. PC had sent info round, and Robin Tasker would provide further info at GridPP 22 UCL. The issue would need to be referred back to the OC. Ongoing. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). Ongoing. 337.4 JG to circulate details of current plan and progress in relation to FTE effort on storage at RAL. Ongoing. 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. Ongoing 341.1 AS to review reslience of services that may have to remain in the ATLAS building. 341.2 RM to contact Ben Waugh about room and phone/video facilites for this meeting. 341.3 NG to consult/inform GRIDPP PMB on responses to EU questionaires. 341.4 JG to establish new date for the wLCG-ORACLE meeting. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). 341.6 SP to relate milestones/metrics to individuals where possible 341.7 SP to chase non-functioning of GridMon with Robin Tasker and report back. 341.8 JC to suggest a more specific agenda item for the UCL DB meeting about sharing T3 resources with T2. INACTIVE CATEGORY ================= 282.8 RM to monitor how R-GMA and networking issues impact on GridPP as matters progress. RM advised that this item should be moved to the 'inactive' category as it will develop over the coming months. RM discussed the issue with Steve Fisher and advised that support of R-GMA is required whilst APEL is dependent on it. RM reported that he has spoken to SF and there is currently no change to the R-GMA situation - process ongoing. RM advised that a small amount of effort was going into R-GMA on APEL but for the long term he wasn't sure. The item needed to be kept here for review from time to time, and required to be re-visited around Easter 2009. AOB === AS asked about the CERN Network outage on the 19th and what was expected to fall over? TD asked whether there was a plan to use this as a global test to understand how a CERN outage would affect the Grid? TC said he was mention this to Jamie. The next PMB would take place on Monday 23 March at 12:55 pm.