GridPP PMB Minutes 348 Ð 21st May 2009 ================================= Present: David Britton (Chair; minutes), Sarah Pearce, Andrew Sansum, Glenn Patrick, Robin Middleton, John Gordon, Steve Lloyd, Roger Jones, Dave Colling, David Kelsey, Tony Doyle, Pete Clarke. Apologies: Jeremy Coles, Tony Cass, , Neil Geddes. Agenda ====== 1. News from STFC [DB] ====================== DB reported that STFC had now set up the new GridPP Oversight Committee and we would be informed of the membership in due course. The next meeting, however, would take place around September and not June as previously planned. This raised the question as to whether we should therefore cancel the face-to-face meeting planned for IC on June 4th, since the primary driver had been to prepare for the Oversight Committee meeting? It was agreed to cancel; the next F2F would be on September 7th before the Cambridge GridPP23 collaboration meeting. STFC had also informed us that they would not be able to support us at the All Hands Meeting in December in the normal way. STFC would be part of an RCUK stand. Nevertheless, GridPP intends to run a stand, possibly in conjunction with the NGS as a step towards a UK NGI. 2. Issues from the Quarterly Reports [SP] ========================================= SP summarized issues from the last quarterly reports: Experiment reports ------------------------- ATLAS 1. Tier-1 job success rates in the batch system remains red, but has increased from 69% to 80%. RJ notes, "Castor faliures improved but still highest source, and batch system failures are high." AS is looking into this. 2. Tier-1 Data availability in storage system increased from 58% to 90%. 3. Tier-2 Job success rates in batch system 74% (down from 86%). RJ - Completely dominated by a misconfiguration at Manchester. LHCb 1. T1 MC production efficiency improved on last quarter - in particular reconstruction/stripping up from 54% to 97%. 2. LHCb SAM tests uptime T1 is red (87% from target of 98%). Raja - RAL was among the sites hit by a problem with LHCb transfers blocking the SRM (last week of Jan and first two weeks of Feb). It was caused by problems with grid middleware and its clash with Dirac. 3. LHCb SAM tests uptime T2 just missed the target (79%) - UCL was down for LHCb for much of this period. Raja - It was probably a question of manpower at UCL. I believe they are working on the problem at present, but it is still not working for us and is being followed up in the dTeam meetings. CMS 1. CMS SAM and availability tests at Tier-2s. DC - Close to target, however this is hiding some major scalability issues that we have found when sites were fully loaded with analysis jobs. Ie they either work perfectly when they have light to medium load or fail completely under heavy load. Fortunately we have seen this now and are working on improving the system so that we should be OK for data taking. Other experiments 1. All metrics OK. CPU efficiency up to 83% (from 57% last quarter) - mainly ALICE, BaBar, ILC and Pheno. 2. User Support and Satisfaction questionnaires for other experiments due next month - GP was to discuss at UB. Operations --------------- 1. Job success rates down to 83%, continuing the downward trend noted last quarter (95% Q208, 93% Q308, 90% Q408). JC - Steve's jobs have less priority than ATLAS production and so when sites are full these jobs are more likely to queue beyond the pass cut-off time in Steve's test framework. So busier sites mean more failures and this is exactly what we are seeing since Q208. As we do more real work and the experiments get busier this test shows declining success rates. For a normal user it may simply mean that the job takes longer to execute. SL- Actually, unlikely to be cut-off-time; more likely be due to aborts due to an incorrect environment. ACTION: JC to investigate whether the decrease in job success rate is due to time- outs at busy sites or due to job-aborts due to incorrectly setup environments. There is another contributing factor in the last quarter and that is several sites in Manchester had extended problems running jobs. Lancaster had a problematic CE and Manchester extended storage problems - both of which impacted the Northgrid performance in Steve's tests. 2. VO blacklists. At the last review of quarterly reports we agreed to approach the experiments about adding these to their reports. Jeremy has been discussing at the DTeam, but it would require some work from the experiments. Do we want to formally request this? Depends on how much work. Issues 3. We have seen some interference from non-LHC VO jobs (mainly biomed and fusion). This has raised some concerns about VO training of users and their approach to using the grid resources. 4. Early test results suggest that larger T2 sites see bottlenecks with storage access when running large numbers of jobs for a given VO. (Hot files; protocol overheads and file distribution) 5. Staffing critical at ECDF and Imperial LeSC Grid Support -------------- Storage and data management ---------------------------- 1. Metrics now all green. 2. Correctly defined space tokens as per experiments requirements now at 99% (from 78% last quarter). However, not all sites are publishing dynamic information correctly. QMUL is publishing all its tokens twice, because they have two SEs. Birmingham and RALPP are also currently not publishing correctly. 3. Three milestones now overdue, mainly because of staff leaving: a) Deliver a tool for sites to report on storage usage per user in a VO for DPM. Jens - Accounting by user is known to be difficult unless users have individual local accounts (in which case it's fairly easy); if they're mapped to pool accounts or otherwise share local (Unix) account mappings (eg by being mapped to a single account per VO like we do with CASTOR) then it's almost impossible. In that case, you need to track data as it comes in, is extended or truncated, or is deleted, by the user's DN, and hope that files are not lost or changed by any other means (or by other users). Neither SRM nor dCache was built to do this or can easily do this. However, Sam is working on this and has made some progress. b) Study integration of experiment data transfer monitoring with SAM systems and nagios alarm systems on site. Ongoing - Sam is working on this. c) Deliver a tool for sites to report on storage usage per user in a VO for dCache. The storage group no longer has anyone with dCache experience. We will revisit this milestone and consider whether it is still required. 4. Greig's post has been filled by Wahid Bhimji, who will start on 15 June. Networking ------------ 1. Maintenance of Gridmon nodes and associated software remains red - JG is looking into it. 2. Draft of the GridPP annual networking forward look now with experiments. Security --------- 1. Two overdue milestones at Manchester - GridSite and security reports. Andrew McNab is working on the GridSite report, we are discussing the security report. 2. Due date for the Security review of Tier-2 sites has been changed to 31 July, to give Mingchao chance to concentrate on the security workshop at RAL in July first. Middleware support -------------------- 1. Milestone 'Middleware releases produced in the first year and report on status of multi-platform support' still in preparation. 2. The narrative on APEL use of R-GMA concludes, "It would appear therefore that R-GMA support will be required for at least another year." No report on WMS Tier-1 ------------ 1. Nearly all the Tier-1 metrics are green. Fabric team metrics are being overhauled to identify more meaningful measures. 2. Job efficiency now up to 88% and farm occupancy at 60%. 3. % of GridPP staff in post now up to 91%. 4. CASTOR SAM tests: LHC VOs (94%/ target 99%). Gradual improvement as the Tier-1 continues to resolve outstanding problems with CASTOR 2.1.7. 5. Red milestones this quarter are: a) Disaster and business continuity plan available/ disaster plan fully implemented - work on disaster management is ongoing, expected to be complete in June. b) Recruitment - reported weekly at the PMB. c) LHC Monitoring infrastructure operational at RAL - Dante are still to commission this service. d) Tier-1 fully operational in R89 - expected to be July 2009 e) 2008 disk/CPU received - waiting for R89 f) Tape robot received/GridPP migrated to new tape robot - delayed due to R89 6. Some of the milestones have been rescheduled, primarily to reflect changes in LHC planning: a) Provide site dashboard for experiments - to Sept 09 b) 2009 disk and CPU tender started - May 09 c) Experiment requirements for 2009 running - May 09 d) 2008 disk/CPU hardware accepted and bill paid - Aug 09 e) Migration to 64 bit - July 09 f) Ready for 2009 running - Aug 09 g) Tier-1 able to meet 2009 WLCG resource commitment - Aug 09 7. We have added a milestone to report on R89 migration in October 09, at request of TC - this is of interest to other LCG sites. Tier-2s ------- LondonGrid 1. A huge amount of resources being delivered (more than 1000% of MoU requirements in CPU). 2. Most metrics are still red, although gradually improving overall. ScotGrid 1. Nearly all metrics are green. 2. SLL ATLAS test performance only 1% at Durham - JC is investigating. 3. ECDF sysadmin support is 'best effort' with no one yet in post. SouthGrid 1. Most metrics passed. SLL ATLAS tests are red - low at Birmingham, Cambridge and Oxford. 2. Loss of expertise with departure of Yves Coppens and John Wakelin. NorthGrid 1. Passes (or nearly passes) most metrics, except management meetings. The report has zero management meetings this quarter - is this correct? External -------- 1. All metrics and milestones in dissemination, LCG and GOC are passed (or nearly passed). 2. NGI metrics are green - milestones in this section are still being revised. 3. RHUL dates for GridPP24 ========================== A proposal had been made by RHUL to host GridPP24 during the week 13-16 April 2010. Easter was early, April 4th; and the IOP was the last week of March. Although this was during the holidays, it was somewhat inevitable since we needed to fit in with accommodation availability. The PMB agreed the dates. ACTION: DB to post dates on the meetings page. 4. Weekly Notes =============== Final call on VO Security Policies: DK had widely circulated the new VO security policies and was in the final 2-week stand-still awaiting any last minutes issues. AHM 2009 paper? DB raised the issue of a paper for the All Hands Meeting. He did not feel that we could/should write another general paper but there might be some mileage on something with a more focused theme. DB suggested Resilience. PC suggested something around the idea of data transfer. This was further discussed (could include CCRC09; ATLAS big-file tests; STEP09; Cosmic data etc). However, there were no volunteers to lead this. PC and DB did not feel overly qualified to lead this and it was realized that the experiment reps would be busy around the time the paper would need to be written. It was decided to revisit this issue in 2 weeks. There is a 30th June deadline for abstracts. STANDING ITEMS =============== SI-1 Tier-1 Manager's weekly report [AS] abric ====== 1) R89 migration schedule is on the blog at: http://www.gridpp.rl.ac.uk/blog/2009/05/14/schedulemovenewbuilding/ 2) Disk, CPU and robotics deliveries are underway. 3) The network change that failed on Tuesday 13th May is rescheduled for 26th May. 4) The site needs to carry out a major network upgrade (to remedy problems they have had for over a month). It will take over an hour (maybe more) and cannot wait for the Tier-1 migration to be in progress. We are trying to avoid STEP but this may not be possible. 5) We expect to start bad block scrubbing on the disk arrays next week. We expect that this will reduce the incidence of multi-drive failures but will have a long term impact on performance. In the short term there is likely to be a spate of drive failures as the scrubbing uncovers undetected problems. Staffing ======== 1)The first experiment support post has been accepted. The second post is shortlisting. 2) The EGEE PPS recruitment failed and we are seeking authorisation from STFC to re-advertise. 3) The YII student (funded by ESC) is expected to start in July. 4) The CASTOR d/b admin is advertised. Service ======= 1) SAM availability last week was 68%, owing to the major CASTOR upgrade. 2) CASTOR a) The ORACLE database RAID array upgrade took place. Unfortunatly the upgrade encountered unexpected complications (despite substantial prior testing) and overran by about 18 hours. b) Work continues locally on a BIGID workaround. Oracle have reproduced the problem. 3) A test SL5 service is published and passing SAM. We are working with the VOs to test it. SI-2 ATLAS Weekly Review and Plans [RJ] See https://www.gridpp.ac.uk/pmb/WeeklyReports/ATLASReport090515- 21.pdf 1) There have been more tests on moving T2s to other clouds (Australia into the Canadian cloud); this has revealed some problems with subscriptions disappearing. 2) There have been some problems with the SARA cloud. It was down last weekend. It had some transfer problems to RAL during the week and had further problems this weekend with data transfers to BNL. 3) There was an upgrade of the central catalogues on Monday. Things went well. 4) RAL Castor has been off much of the week for the DB work. 5) Much discussion internally for UK concerning the resource needs for STEP09. 6) Hammercloud tests are being run in the UK. There are the usual list of new and old problems, but sites are being generally responsive . SI-3 CMS Weekly Review and Plans [DC] See https://www.gridpp.ac.uk/pmb/WeeklyReports/CMS%20Summary%2021%20 May%2009.pdf 1 week ago: 62.5% T0+T1 sites out of 8 have an availability >90%. THIS WEEK: 75% T0+T1 6 sites out of 8 have an availability >90% It appears that the RAL problems were a hangover from the change of user mappings on the new that was reported last week. I think that it was more the way that the tests were run rather than a RAL problem. Not a great week for the T2s. Each had minor problems that kept them out of action for reasonably large periods, although only RALPP below the magic 80% Plans for the rest of the week: Start some pre-staging tests, otherwise business as usual. SI-4 LHCb Weekly Review and Plans [GP] 1. SAM jobs failing at various sites on the Grid (including Edinburgh and Imperial) because the shared software area runs out of space. Problem at Imperial already fixed by ATLAS cleaning up its area. Fromthe LHCb VO-card, the requirement is for 50GB of space in the shared software area for LHCb. 2. At RAL Tier 1, diskserver (gdss160) down for LHCb due to fsck problems. It was down in early March this year due to the same problems. 3. RAL now running jobs successfully again after the downtime earlier this week. Outlook : Chaotic user analysis jobs. MC09 production to restart after bugs in application software have been fixed. Some tests being run currently. SI-5 Production Manager's weekly report [JC] There was no report this week. SI-6 LCG Management Board Report of Issues [JG/DB] The MB had been canceled this week. SI-7 Dissemination Report [SP] SP reported on the release on Thursday of the PP2020 report in a media event. Links to the Grid are primarily focused on the BioMed use of our resources (10% for last 5 years, generating BBC articles on Malaria and Avian Flu). SP also mentioned that NO is preparing to do news items on STEP09 etc but waiting for some info from the experiments. Review of ACTIONs ============== 332.1 AS to provide a plan for the tape drives: The experiments would be producing new numbers by ~end of March which would enable this action to progress. AS now has info provided by CMS and Graeme Stewart - he will begin work on this soon. This was ongoing but progressing. 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. JC was not present. 341.3 NG to consult/inform GRIDPP PMB on responses to EU questionaires. This is defacto done. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). JC was not present. 343.7 AS to look into the CASTOR/ATLAS instance availability (in the light of the experiment red metrics). AS had spoken to GS & RJ and asked for a breakdown to achieve consistency - responses awaited. AS had circulated an email describing investigation in some detail. Action done. 345.1 JG to speak to RT regarding GridMon and GridPP funded network effort. Ongoing. 346.1 DB to check the current situation re website work and raise the whole NGI website issue with NGS. This was raised: NGS thought it a good idea and (presumably) will take action. Action Done. 346.2 DB to raise the issue of SL5 and 32/64-bit middleware at MB & GDB level, for policy decision/direction. This had been raised: The MB had bounced to the GDB and thus to JG who was now in the process of writing a report for the MB. This would be circulated to the PMB. Action done. 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. Ongoing. 346.4 JC to summarise the responses from Camont for the PMB in relation to security, reliability, commercial use of networks, and suitability/acceptability of content. JC was not present. 347.1 DB to respond to STFC w.r.t the Tier-1 Hardware schedule. This had been done Ð content documented in last weeks minutes. ACTIONS AS OF 21.05.09 ====================== 332.1 AS to provide a plan for the tape drives: In progress. 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). 345.1 JG to speak to RT regarding GridMon and GridPP funded network effort. 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. 346.4 JC to summarise the responses from Camont for the PMB in relation to security, reliability, commercial use of networks, and suitability/acceptability of content. 348.1 DB to publish GridPP24 dates on the meetings web page. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. AOCB ===== AS requested info on weekend hiking/running in Scotland. DB agreed to take off line (Summary: 9 Munros; 90Km; 5000m). The meeting ended at 2:25pm. The next meeting will be on June 1st. _____________________________________________________________________