GridPP PMB Minutes 350 - 8th June 2009 ====================================== Present: David Britton (Chair), Sarah Pearce, Tony Doyle, Andrew Sansum, Robin Middleton, Steve Lloyd, Jeremy Coles, Dave Colling, Pete Clarke, Tony Cass, Roger Jones, Glenn Patrick Apologies: Neil Geddes, David Kelsey, John Gordon 1. GridPP23 theme ================== DB advised that a registration page would be required shortly - one idea was to focus the meeting on users. DB had sent the user reps an email re input but nothing had been received, however the Agenda would need to be defined. GP suggested that a user-focussed meeting would be preferable following LHC data-taking, rather than just now. DB suggested an experiment-site discussion session, and one on how we support users. TD commented that there was not a huge time for discussion over the two days. However, JC agreed that it would be good to get the experience of users who have joined. DC noted that we would need to offer something specific to get users to attend; also, such a meeting focus might encourage those with bad experience to merely turn up to complain. GP advised that it would be difficult to attract a mix of people at this time. DB suggested that an experiment-specific workshop was possible, but it was possibly not the right time for that. The next GridPP meeting, taking place at RHUL at Easter next year would be a better opportunity for user topics - it would be a more natural order then following LHC switch-on. It was noted that the last meeting was 'are we ready' - it might be helpful to do something similar again. DB advised that there was a wLCG STEP09 post mortem workshop happening in July - the GridPP meeting following could be more generic: 'final steps to LHC' or similar, distilling the STEP'09 lessons - this would address issues like: how are we going to support users, etc. 2. All Hands Meeting - paper ============================= DB noted that GP had submitted a top-level GridPP paper last year - which had been one of the minority of papers published in Phil. Trans. A. but with another generic GridPP paper. DB suggested that we would be unlikely to succeed this time, therefore the paper needed to be more specific, coming from an individual or a group, at lower level. PC advised that even if we thought we could get another high-level paper, it wasn't really required at this stage. DB suggested waiting a year and submitting one next year once data has come through - effectively highlighting the outcome of 7 years of work. PC noted that it would be good to give some framework or guidance to anyone to assist submission, or we would be unlikely to get any interest in doing this. DB suggested that PMB members could work with people to assist others to submit to the AHM, or a combined paper was possible - it would be good to get other names on the papers. It was noted that abstracts were due at the end of June, for possible publication by November. DC advised that we could co-ordinate a STEP paper. GP noted that he would be happy to assist up to a point, but had other commitments at this time. DB asked if DC could co-ordinate something? DC confirmed yes. DB asked if PC would be interested in talking to Greig Cowan and trying to involve data management people? PC confirmed yes. DB asked AS if he would be willing to collaborate on resilience? AS confirmed yes. SP noted that students could submit as well - there would be a prize for best student paper. ACTION 350.1 DB to investigate the possibility of submitting an abstract to the AHM. 350.2 DC to investigate the possibility of submitting an abstract to the AHM. 350.3 PC to investigate the possibility of submitting an abstract to the AHM. 3. Week's Notes ================ - top-level BDII in the UK DB noted that one existed at both Glasgow and Manchester, and was not sure why this discussion had happened at the CASTOR meeting? AS advised that this came via Tier-1/experiment liaison, and came out of a general discussion re planned shutdown and WMS access being moved. The conclusion was that it would only be out for 8-12 hours. JC advised that this also related to site-level BDII issues at RAL with reference to load. The dTeam conclusion had been that a lot of infrastructure was required to support BDII. DB re-iterated that there existed a top-level BDII at Glasgow and Manchester but not automatic failover - Mike Kenyon and Sam Skipsey at Glasgow would need to be contacted, but in principle this had been running for a long time now and was available as an alternative. DB suggested that this could be treated as a test. AS confirmed that Matt Hodges could arrange the test. ACTION 350.4 AS to investigate whether the Glasgow BDII can be tested as a backup UK BDII during the downtime associated with the move to R89. - 'contact list' DB noted that we had various contact lists, and in advance of LHC startup we needed to check that they were up-to-date. JC confirmed that the site numbers of site admins were in the GOCDB. DB advised that the 'people and roles' on the GridPP website was not up-to-date. ACTION 350.5 JC to check and verify that the contact list on the GOCDB is up- to-date - to be done by September. DB noted that we needed to review the experiment contacts also. He asked where they came from? If there was a problem with ATLAS, who would be called? GP advised that the UB list on the UB pages, plus email lists, would be used. ACTION 350.6 GP to check and verify that the contact list on the UB pages is up-to-date - to be done by September. - Royal Society application It was noted that that time had come around again ... would they turn us down again this time? DB advised that combined experiment input rather than GridPP input might be preferable. It was considered that the chances of inclusion were not high. TD suggested leaving it until next year. SP agreed, saying we should try again once we have results to discuss. This was agreed, however experiments were not precluded from submitting, if they wished to do so, with GridPP support. 4. Networking ============== PC reported that the network forward look was taking a long time, and no-one regarded it as urgent. PC noted there was no problem at the moment and nothing further was required to be written down. Doing this annually was valuable as a focus. PC advised that the document at the moment needed finished - he had sent it to DC and RJ - could he have a reply, or, organise a phone meeting? Friday was suitable for all three, so PC would do a mail to confirm a phone meeting for Friday in order to finalise the document. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: 1) R89 migration planning continues on track. Schedule is on the blog at: http://www.gridpp.rl.ac.uk/blog/2009/05/14/schedulemovenewbuildi ng/ Migration of non-Tier-1 equipment started today (includes the CA) and by the end of the week we will know if the logistics plan for migration is achievable. 2) Disk and CPU installations are complete and vendor testing is going well - we plan to start our own tests this week (1 month burn). 3) Robot installation is complete and it is ready for drive installation during the R89 migration. 4) The failure to complete the network upgrade to the C300 core switch had a bigger impact than first realised. It has left us with two 10Gb links (OPN and SJ5) combining down to a single 10Gb link to the Tier-1. This is peaking at 8Gb/s during STEP. We are considering our options but may need to fit a second 10Gb board on the existing UKLIGHT router. 5) The site needs to carry out a major network upgrade (to remedy problems they have had for over a month). It will take over an hour (maybe more). The 23rd June has been proposed but is not definite yet. 6) Bad block scrubbing of non-production CASTOR (and production NFS/xrootd) will start tomorrow. Scrubbing of CASTOR production servers will not start until after STEP. Staffing: 1)The first experiment support post has been accepted. The second post is shortlisted - interviews on 15th June.. 2)The EGEE PPS recruitment failed and we are seeking authorisation from STFC to re-advertise. 3)The YII student (funded by ESC) is expected to start in July. 4)The CASTOR d/b admin has shortlisted and will interview on 16th June. Service: 1) SAM availability last week was 100%. 2) CASTOR - We are still waiting for the BIGID fix from ORACLE. 3) STEP09 - Operation is going very smoothly during STEP. No significant operational problems have occured other than a 4 hour break in the CASTOR service for ATLAS following an unusual BIGID failure mode. Full details are at: http://www.gridpp.rl.ac.uk/blog/category/step09/ We are currently trying to get more ATLAS jobs running following a ticket from ATLAS noting that they were failing to reach target share. Our first attempt at 11:00 this morning to overcommit memory failed and we are considering our next move. SI-2 ATLAS weekly review and plans =================================== RJ advised that there was a report on the website. Generally they had been doing well, the Tier-1 processed 400,000 files in 6 days and tape was holding up well, but, there hadn't been much CMS input, so they would learn more next week with a combined load. DC countered that CMS had been using it all last week with pre-staging. RJ asked if he could have the numbers for that? AS advised that they had an overall rate-to-tape plot, doing 4-5GB per second; he was meeting with the tape team tomorrow. Four drives were available for ATLAS and four for CMS. RJ reported that over the weekend they got 30% of CERN output - more than usual - and handled it ok, generally they were being over-driven. AS noted that he had been watching the Tier-1 to Tier-2 traffic and saw doubled rates this morning, and spikes out to JANET. RJ advised that they had been doing 283 MB per second from RAL to the Tier-2, they had problems with the number of running jobs maxing at 4-600 - this was the consequence of a request for 3GB memory per job, which meant that shares were lost elsewhere. AS advised that these issues were being investigated at present - ALICE jobs had been waiting to start in the queue and Matt would let him know. RJ reported that there was a support plan for the Tier-1 when they hit problems - it was good to keep general user jobs out, which was helping at the moment. Re the Tier-2, with analysis jobs there was a problem re data transfer from RAL to Tier-2 sites - they were throttling back - they need to understand the problem. Site storage tuning was happening, RJ reported contention on storage. RJ advised that RHUL had issues with slow file transfers and this was being investigated. The general conclusion was that once STEP was out of the way, they would do a data mining exercise to sort out analysis issues. TD asked AS about the highest overall rates so far? AS reported that at the Tier-1, 15GB per second had been the highest rate they'd seen overall; this had spiked at 35GB per second. SI-3 CMS weekly review & plans =============================== DC suggested that people look at the Wiki pages for a full report of current experience. Key issues were: - Prestaging of data (going reasonably well at RAL ~50Mb/s) - rates writing to tape (100MB/s when streaming - clearly mounting time in addition) - Real users getting rather annoyed at the STEP fake analysis jobs taking too many analysis slots and slowing down real analysis that is going for approval. - CMS overall having many STEP unrelated problems with other T1s (FZK, CNAF, IN2P3 and CNAF). - Generally UK going quite well. SI-4 LHCb weekly review & plans ================================ GP reported as follows: Points relating to RAL: 1. Brief interruption to Castor operations (primarily affected the transfers) when a database fix was put in to better debug the bigID problem. 2. "Over-configuration" of LHCb_MC_M-DST space token, caused unnecessary local copies to lhcbRawRdst (a d0t1 class) service class before transfers. Fixed now. Summary of jobs at RAL since 1 June 2009: Successful : 7197 Failed : 112 Stalled : 42 UK Tier 2 issues (to be raised in Dteam): 1. Cambridge : All pilots with role "lcgadmin" are aborted with "Condor and Maradona" error. GGUS ticket opened - https://gus.fzk.de/ws/ticket_info.php?ticket=49224 2. UCL : Waiting for new CE (what is the status?) 3. Imperial-LeSC : Software installation jobs stalled. We will retry before opening GGUS ticket if needed. 4. Scotgrid-ECDF : Shared area filled up - software removed and reinstallation. Problem during the reinstallation - will retry. Outlook/Operations: 1. FEST operations will start this afternoon. 2. Started dummy production of minimum bias Monte Carlo to test out the latest version of LHCb application software. Steady pressure of 6K - 7K jobs on the grid since 5 June. 3. TED runs were taken over the weekend by LHCb. All expected tests and configurations tried. Lots of tracks recorded and being analysed. SI-5 Production Manager's Report ================================= 1) STEP09 is progressing well with most GridPP sites fully engaged - meaning sites reactive to requests for changes, site-admins actively reviewing clusters (spotting disk sever loads under rfio reaching maximum for relatively poor throughput in some cases) and digging up usage information to help resolve issues. There are some issues with general users having to wait for output as CPU utilisation across GridPP has crept up above 95%. The internal networks at sites are being heavily tested - Oxford for example doubled their bandwidth from 1Gbps to 2Gbps and it was all used as soon as made available! Event efficiencies are low as a consequence. Also the previous rfio tuning done for DPM seems not to work so well for new larger files. 2) The May 2009 availability and reliability figures have been published (circulated earlier today). Two results pull the region low overall - UCL-CENTRAL and Manchester. For Manchester the 0% looks wrong because the site has been running normally and our straight from SAM figure suggests 98% availability for the last month ( http://pprc.qmul.ac.uk/~lloyd/gridpp/samtest.html). Indeed looking at the reliability plot also shows everything normal during May: http://pprc.qmul.ac.uk/~lloyd/gridpp/plots/SAM_R_Recent_UKI- NORTHGRID-MA N-HEP.png. On further investigation it looks likely that the site-BDII is not publishing correctly. For UCL-CENTRAL there problems! http://pprc.qmul.ac.uk/~lloyd/gridpp/plots/SAM_A_Recent_UKI-LT2- UCL-CENT RAL.png. The site instability is caused by outages of the cluster files system being used and problems encountered when trying to put in place permanent fixes Lustre. Reliability wise the main issue is the CE running out of resources due to a limit of 2GB memory (replacement is in progress). 3) From 15th June regular COD will be obsolete and we will be running with regional on-duty teams only. For UKI our team is composed of the 4 EGEE coordinators plus John Walsh from Grid Ireland. We are encountering some problems in setting up the ROD environment and tools but will be ready for 15th. Coincident with 15th changeover there is a meeting in Helsinki to firm up inter-region operations. 4) Benchmarking is progressing in each of the Tier-2s. Most sites have now ordered the benchmarking suite with an intention of completing the task in the coming month. About five sites have finished. There is now a good guide to benchmarks for various hardware on the WLCG pages which sites could potentially use if unable to benchmark themselves. 5) There are more suspicious IPs being checked in relation to the ongoing ssh incident. Checks of these vs the extra load for STEP09 means that there are some delays in checking logs. Gauging the relative urgency of tasks is likely to become more difficult and therefore harder to manage. Perhaps we need a GridPP wide mirroring of the priorities as seen in the Tier-1. SI-6 LCG Management Board Report ================================= DB noted no Board meeting last week, it had been cancelled. SI-7 Dissemination Report ========================== SP reported that Neasan O'Neill had put up a news item about STEP'09. He had spoken to CERN regarding a press release about STEP'09 but there was no info available yet. They will try and push this forward. DB noted that we may get some good numbers out of this that we can use - point to achievements prior to LHC startup, concentrating on resilience etc. There was a discussion of an 'opening' event for R89 but it was decided to leave this till the end of the year. REVIEW OF ACTIONS ================= 332.1 AS to provide a plan for the tape drives: In progress. ONGOING 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). ONGOING 345.1 JG to speak to RT regarding GridMon and GridPP funded network effort. ONGOING 346.3 PC & AS to speak to Robin Tasker regarding receipt of high- level or ticket information from JANET on the service. ONGOING 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job- aborts due to incorrectly setup environments. ONGOING 349.1 AS to report-back next week re the MSS metrics plan. AS reported that the info we used to do manually was almost complete re automation - a live feed will be available therefore presence on CERN metrics was possible. DB advised that it would be useful in the UK for us to have a simple set of numbers as part of the dashboard, numbers that are easier to monitor and provide feedback to experiments. DONE, item closed. 349.2 JC asked to advise dTeam that it was their last chance to provide comments and feedback on the accounting security policy. DONE, item closed. ACTIONS AS AT 08.06.09 ====================== 332.1 AS to provide a plan for the tape drives: this was being finalised this week. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). JC reported that there had been a change of responsibility within EGEE - he was checking on the best person to ask. 345.1 JG to speak to RT regarding GridMon and GridPP funded network effort. DB reported that he had offered that a sysadmin at Glasgow get involved. GridMon had been raised with RT but no response had yet been received from Mark Leese. DB needed to speak to JG. 346.3 PC & AS to speak to Robin Tasker regarding receipt of high- level or ticket information from JANET on the service. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job- aborts due to incorrectly setup environments. This was still in progress - he needed to extract data but was busy with STEP at the moment. 350.1 DB to investigate the possibility of submitting an abstract to the AHM. 350.2 DC to investigate the possibility of submitting an abstract to the AHM. 350.3 PC to investigate the possibility of submitting an abstract to the AHM. 350.4 AS to investigate whether the Glasgow BDII can be tested as a backup UK BDII during the downtime associated with the move to R89. 350.5 JC to check and verify that the contact list on the GOCDB is up- to-date - to be done by September. 350.6 GP to check and verify that the contact list on the UB pages is up-to-date - to be done by September. AOB === DB reported that the Oversight Committee meeting date was Tuesday 15 September 2009. It might be possible to submit papers following the PMB F2F on the Monday at Cambridge. DB needed to discuss this with SP, and confirm a plan of action. It was noted that this would be a new Committee, therefore some element of the meeting would be to get them up to speed. JC reported that 13 groups had signed up to the gLite Open Middleware Consortium. The next PMB would take place on Monday 15 June at 12:55 pm.