GridPP PMB Minutes 349 - 1st June 2009 ====================================== Present: David Britton (Chair), Andrew Sansum, Robin Middleton, John Gordon, Steve Lloyd, Jeremy Coles, Dave Colling, David Kelsey, Tony Doyle, Pete Clarke Apologies: Sarah Pearce, Tony Cass, Neil Geddes, Roger Jones, Glenn Patrick 1. Week's Notes ================ - Mass Storage Metrics (MSS) for wLCG (raised by MB) This related to the publishing of performance metrics. JG advised that we run the tape servers so that the data collection is in there, and CERN extract from the logs. CNAF have sent us scripts that they use to copy the info. DB noted that we didn't want to be the only Tier-1 not publishing. AS advised that he would be attending a meeting on this shortly. ACTION 349.1 AS to report-back next week re the MSS metrics plan. - date of next OC DB reported that the OC were trying to set up a meeting for the beginning of week commencing 14 Sept (probably the Mon, Tue or Wed), but this was not certain yet. PMB were asked to note these potential dates for their diaries. DB advised that the F2F at Cambridge would probably be too late for a meaningful discussion, so we would need to plan in advance for preparations for the OC meeting. - updated security policies DB noted that three security policies had recently been circulated: accounting; VO portals; security incident responses. DK advised that the accounting policy had been around as a draft for a long time but had never been formally approved. It addressed user-level accounting and privilege issues; the deadline was the end of this week. wLCG policy was to turn it on. DK was assuming that all was fine. JC was asked to note at dTeam that this was the last chance to comment. ACTION 349.2 JC asked to advise dTeam that it was their last chance to provide comments and feedback on the accounting security policy. Re the VO portals policy, DK advised that it was a new policy that came out of the EGEEII Working Group. The policy discribes the authentication classes against the portal, and applications they can run. It was optional to mandate the robot certificate. The policy had been around as a draft for some time but little feedback had been received. The deadline for feedback was 12th June. The final policy was an updated security incident response. DK advised that the old policy had been around and adopted for two years now, it mixes up the policy statement and procedure - this new version updates it and clarifies these separate issues. DK has received some comments. Deadline is 12th June. JC was asked to note this policy at dTeam. ACTION 349.3 JC asked to advise dTeam that it was their last chance to provide comments and feedback on the updated security incident response policy. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric ====== 1) R89 migration planning continues on track. Schedule is on the blog at: http://www.gridpp.rl.ac.uk/blog/2009/05/14/schedulemovenewbuilding/ 2) Tier-1 team moved to the new R89 building this morning. 3) Disk and CPU installations are complete and vendor testing is underway. Once that completes, our own testing will commence. 4) Robot installation is complete and stress test is well advanced. 5) Two further network interventions were carried out by the network group last week to move our C300 core switch to act as a router. Both interventions failed, the second causing considerable disruption. We will be meeting with the network group to review what went wrong. This change may not now be possible to carry out before data taking. The implications are mainly minor inconvenience to fabric operations (who will need to continue to maintain host based routing tables). 6) The site needs to carry out a major network upgrade (to remedy problems they have had for over a month). It will take over an hour (maybe more) and cannot wait for the Tier-1 migration to be in progress. We are trying to avoid STEP but this may not be possible. No date yet available from network group. 7) Bad block scrubbing not yet started. Staffing ======== 1)The first experiment support post has been accepted. The second post is shortlisting. 2) The EGEE PPS recruitment failed and we are seeking authorisation from STFC to re-advertise. 3) The YII student (funded by ESC) is expected to start in July. 4) The CASTOR d/b admin is advertised. Service ======= 1) SAM availability last week was 95%. Degradation caused by network intervention. 2) CASTOR a) The workaround for BIGID is in place, although cleanup remains manual. A new release of CASTOr and the SRM will be available to avoid this problem in future. 3) We are waiting for STEP to start. SI-2 ATLAS weekly review & plans --------------------------------- RJ was absent this week. SI-3 CMS weekly review & plans ------------------------------- DC reported that the big issue this week was STEP starting today for CMS. They were doing pre-staging tests, two different approaches were being used. The Tier-2 seemed ok, and no UK ones were below 80%. DC would send round some plots. SI-4 LHCb weekly review & plans -------------------------------- GP reported as follows: Over the last week for LHCb at RAL, user jobs have run mostly fine, though there have been some glitches in production jobs, mainly due to DIRAC problems. We ran a series of small productions to simulate between 100K and 30M events. Some of these have already finished and some are still running. There have been a lot of problems at the dCache Tier-1s which have taken up the attention of the development and operations team over the last week. The network interventions at RAL did not seem to affect LHCb as the new procedures in place by the Tier-1 were followed to make the intervention almost transparent to us. We still do not know what was the problem with the disk server gdss163 at RAL which failed twice (memory corruption) over the last 3 months leading to loss of data and if it has been solved now. AS reported that he was trying to find out about this - he would need to talk to GP on his return about which server is involved as AS thought this issue had been dealt with already. STEP 09 for LHCb is nominally scheduled for the week starting 8 June 2009. The aim is to exercise at nominal rate the real data recording (Tier0), transfer (Tier0- >Tier1), reconstruction (Tier0+Tier1 for LHCb, Tier0 for others) as well as re- processing (Tier0+Tier1 from data not present on the disk cache, i.e. recalled from tape). For the upcoming week, we will have more small productions and the normal user jobs. SI-5 Production Manager's Report --------------------------------- JC reported as follows: An update from the deployment and operations area: 1) Preparations for STEP09 have dominated work over the last 1-2 weeks. UK sites are well informed of what is required and have received good support in setting up. A few sites (QMUL, Liverpool, RAL-PPD and Oxford) have lacked sufficient disk space for the desired ATLAS data imports, but this is being worked around. 2) Sites have been informed of the Tier-1 service migration and the Glasgow and Imperial WMSes have been updated to support additional UK VOs. 3) There is a desire to use a virtual meeting in Skype (as ATLAS and ScotGrid operations have done) to enable region on-duty staff to commuicate with sites, however there are uncertainties about usage at some sites. There was discussion of a JANET based jabber/chat tool but this does not seem to be available to use at the moment. Google chat is a likely alternative. 4) With help from APEL support, Alessandra has managed to recover records for Manchester from November onwards. It was previously reported that Manchester would have a gap as the records could not seemingly be recovered. 5) There has been a new discussion about user banning after an ATLAS user (Kors) was seen to be "attacking" a site LFC. It seems there is no correct way to ban a user on the LFC. This raises questions about how the experiments run their production work and how sites should respond to problems. Note these two comments "you definitely mustn't ban Kors Bos. His certificate is used to run all the production atlas data management activity!" and "If the activity kills the LFC, the admin has every right to block the user, whoever he happens to be!" Do we have a timeline for moving away from user certificates being used in this way? SI-6 LCG Management Board Report --------------------------------- JG reported that he had joined remotely at HEPiX. There had been a lot of discussion on tape metrics; also on SL5 - they asked the architects' forum to address issues in relation to SL5; there was discussion on the solution to compiler. SI-7 Dissemination Report -------------------------- SP was absent this week. DB advised that a news article had been posted on the website about 20/20 vision. REVIEW OF ACTIONS ================= 332.1 AS to provide a plan for the tape drives: In progress. 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. ONGOING. 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. This issue has been addressed in the VO Policy Document in relation to VO decommissioning. DONE, item closed. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). JC was awaiting a response re policy & procedure. ONGOING. 345.1 JG to speak to RT regarding GridMon and GridPP funded network effort. ONGOING. 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. ONGOING. 346.4 JC to summarise the responses from Camont for the PMB in relation to security, reliability, commercial use of networks, and suitability/acceptability of content. JC had circulated a response. AS will read and respond. JC to update once the tests have been done at Cambridge. DONE, item closed. 348.1 DB to publish GridPP24 dates on the meetings web page. DONE, item closed. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. ONGOING. ACTIONS AS AT 01.06.09 ====================== 332.1 AS to provide a plan for the tape drives: In progress. 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). 345.1 JG to speak to RT regarding GridMon and GridPP funded network effort. 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. 349.1 AS to report-back next week re the MSS metrics plan. 349.2 JC asked to advise dTeam that it was their last chance to provide comments and feedback on the accounting security policy. AOB === DK advised that Steve Dallison had died last weekend, aged 35. he had been working on the Tier-2 since Jan 2008. The PMB wished to pass on their condolences to all those who knew him. The meeting ended at 1:50 pm. The next PMB will take place on Monday 8th June at 12:55 pm.