GridPP PMB Minutes 336 - 2nd February 2009 ========================================== Present: David Britton (Chair), Tony Doyle, Sarah Pearce, Roger Jones, David Kelsey, Pete Clarke, Dave Colling, Jeremy Coles, Steve Lloyd, Glenn Patrick, John Gordon, Andrew Sansum, Tony Cass (Suzanne Scott - Minutes) Apologies: Robin Middleton, Neil Geddes 1. GridPP Consumables ====================== DB had circulated an email relating to action 322.4, to clarify the use of Tier-2 hardware grants on items deemed as "consumables". The PMB were asked for comments? All agreed the statement looked fine. It was noted that it needed to go to Dave Colling for comment (not present at the beginning of the meeting). DB proposed to wait until DC had approved the statement, following which DB would circulate to PIs of grants to assist with their procurement and audit procedures. DC then joined the meeting and responded direct to DB by email. The final statement agreed with the PMB and STFC was as follows: "A fraction of the total amount awarded on the GridPP Tier-2 hardware grants may be spent on consumable items required to install and run the hardware. These items should conform to the definitions of section 5.1.8 of the STFC fEC Research Grants Handbook which includes, amongst other things, "items of equipment costing less than £3000 (including VAT) including replacement or upgrades of existing equipment". The total amount spent on consumables is not limited to this sum but spending against this heading should not compromise the final delivery of the computing service that the Tier-2 has contracted to deliver The URL is http://www.stfc.ac.uk/rgh/rghDisplay2.aspx?m=s&s=135 Following the meeting, DB circulated this to Grant PIs, noting that: "You may find this statement useful internally or when you eventually close out these grants. In either case, the only thing that it (possibly) adds to what is in the handbook (URL above) is that 'GridPP expects the money to be used to deliver the service for which it was awarded.' ". 2. Quarterly Reports ===================== SP reported that some reports were beginning to come in. She was awaiting CMS & LHCb. GP would check with Raja. SP had received a draft from JC, also drafts on networking and storage - she was still awaiting security and middleware. DK noted that he was working on security at the moment. Middleware was awaiting inputs from Steve Fisher and DC. TD would remind. It was noted that the Tier-1 was still awaited, and the Tier-2 drafts were not signed-off yet. SP was also still awaiting EGEE GOC metrics and NGI metrics. The PMB understood that the Quarterly Reports were urgent and needed to be done asap. DB advised that next week there would be a published 'blacklist' in the Minutes. STANDING ITEMS ============== SI-1 Tier-1 Manager's report ----------------------------- AS provided the following report: Fabric: 1) We have concluded that the machine room is unlikely to be delivered by 9th February as required to meet our 3rd March migration date for the Tier-1. At present we have insufficient information from the builders to reach a conclusion as to the likely completion of work. We will shortly be rescheduling the migration plans with those tasked to move our equipment. Unless the builders provide convincing evidence that they can meet a revised schedule then our planning will be conservative and the migration date will be pushed back at least 4-6 weeks (if not more). Input is requested from the PMB as to what the latest date they would consider acceptable for a shutdown/resumption in the Tier-1 service given the revised LHC schedule. AS asked when stability would be required by the experiments? He needed an experiment view now or by email soon after the meeting. DK advised that the experiments don't know yet - they won't know until after Chamonix. AS noted that there may be a date beyond which they couldn't move to the new machine room this year. RJ advised he wouldn't know anything until earliest Friday evening. DC noted that the CMS challenge was happening in March. GP also noted that for LHCb there was another FEST exercise in March, but the smaller experiments would be more affected. DB advised that we needed to be in the new machine room with stable operations before LHC data arrived - readiness would possibly be required by end July. Moving this year was probably a prerequisite, as next year would be practically impossible. 2) Owing to the delays in the R89 schedule (indeed predating that) we have been forced to agree payment of 70-90% of the disk and CPU purchases. In one case on receipt of the equipment into RAL storage and in others taking ownership of equipment at suppliers' sites (not yet actually implemented). This has been on the advice of our procurement department (that we had very little room to negotiate) following discussions with suppliers. 3) Delivery of the new robot will now be scheduled to complete as soon as the building becomes available. 4) I have been working on tape drive and capacity planning. We have received a new roadmap from Sun which we will build into our planning. Careful consideration will be needed before new plans are formed. No purchase planned this year. 5) Puchasing of remaining items on spend plan is progessing. Staff: 1) Advertisments will be issued shortly for the production team post. 2) Paperwork is prepared for the EGEE funded position (PPS) we are in discussion with an internal candidate. 3) Experiment Support posts have shortlisted. Interviews 3 February (weather permitting). Service: 1) SAM availability last week was 100%. 2) CASTOR a) We continue to chase the big ID problem and have sent some dumps to Oracle (but need to aquire further debug info). This problem is impacting availability for ATLAS (it is also slightly impacting LHCB). 3) There was a 3h break in connectivity on the OPN over the weekend (traffic continued via production links). Cause not yet known. SI-2 ATLAS weekly review & plans --------------------------------- RJ provided the following report: The tail end of the 10M file test has finished. RAL is in a good state, but SARA, FZK, Lyon, Taipei and Triumf have problems to address. We may return to the test including some Tier 2s. Our next user analysis challenge starts at 9am on Tuesday, when we will use the filestager tool instead of posix IO. Cambridge, Bham and QMUL are added to the tests. The ATLAS-SRM call-out this morning requires further investigation, but so far the system looks healthy after a re-start. The SS/VO box is not very healthy and needs a upgrade; this wll happen this afternoon. We may have a separate DDM box for each cloud shortly. This caused a backlog on Sunday, but we currently have no backlog of work other than a 'within tolerances' backlog of AOD transfers. We have requested an increase in our MCDISK space token space, as we filled 40TB in a month. This is in hand. SI-3 CMS weekly review & plans ------------------------------- DC noted that things were fairly quiet, they are starting prep for end-to-end challenge in March. There was an incident affecting the UK where files were lost. No other issues to report. SI-4 LHCb weekly review & plans -------------------------------- GP provided the following report: 1. First FEST week has ended. Generally regarded as a success. Events were injected at 1.6kHZ into the HLT Farm with emulation of some missing (Outer Tracker) banks. Followed by storage and transmission of FULL and EXPRESS streams through the offline chain with reconstruction at centres by DIRAC. 2. Whilst running reconstruction on the files at RAL, the big ID problem in Castor/Oracle was encountered. Database cleaned up Friday afternoon. 3. "Issues" with a disk-server preventing access to data on it. Fixed by Tier-1 fabric team. 4. Problem observed with LHCb jobs being killed after 2+ days at sites (including Imperial and Edinburgh. Not yet understood - under investigation. Outlook : 1. End of FEST and analysis of results. Problems found in most sites. 2. Simulation of 100Million events with updated versions of LHCb application software - likely to start after the end of next week - for next FEST week. 3. Next FEST week scheduled for 2-6 March. SI-5 Production Manager's report --------------------------------- JC provided the following report: We have seen some issues with site-experiment communication but believe it is in hand - basically ATLAS dropped some sites in Panda while problems were fixed and the sites were not put back into production. This was noticed via UK shifters after a few weeks. Although we have a wiki page for monitoring links we are looking at how to improve it so that sites are more easily able to check their status. An upgrade to the CERN grid dashboard will help since it offers "site views" in addition to "experiment views". Tickets are also issued during the site removal process so these should catch the site status and cause it to be reviewed - unfortunately it did not work for a few sites recently. There has always been some concern about how sites know they are excluded vs no jobs running for the VO. On a positive note, GridPP sites contributed 22% of the LHC VO processing during January. The overall ops SAM test results show a 91% availability. The question of sites getting back in was down to process, and the site admins could go into ATLAS monitoring and check. It was suggested that this could be added to SL's page which might be a way of checking site status? JC would follow-up these suggestions. ACTION 336.1 JC to document procedure for ensuring black-listed sites are re-instated. SI-6 LCG Management Board report --------------------------------- JG had attended the LCGMB this week and noted that the main issue was a presentation on CPU benchmarking. JG would need to check how we publish capacity etc, so that we know what people are publishing. There were also presentations on VO-specific SAM tests. SI-7 Dissemination Report -------------------------- SP reported that Neasan O'Neill had booked GridPP for a space at CHEP, he was also organising a UKI stand at the EGEE User Forum. There had been a call for people interested in helping with LHC@home, and a few replies had been received from groups willing to assist. REVIEW OF ACTIONS ================= 322.4 DB to follow up email with information on what Universities classify as consumables, (in relation to GridPP using a figure of 5% as a guide to grantholders, regarding small level of consumables allowed on the Tier-2 Grants). D O N E, item closed. 332.1 AS to provide a plan for the tape drives, given the new information from IB - an detailed plan was required immediately, showing minimum spend relating to no data etc. O N G O I N G. AS was awaiting the new experiment capacity plan, and would keep SP informed. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. O N G O I N G. 333.3 JC and DC to write a paragraph which summarises the Tier-2 and Tier-3 positions. This need not come back to the PMB for discussion, but can be circulated once done. Response from DC awaited. O N G O I N G. 334.1 ALL: to provide early drafts for the Quarterly Reports. Required immediately please. O N G O I N G. 334.2 RJ to confirm by email when the proposed three days for GridPP25 at Ambleside had been booked. O N G O I N G. ACTIONS AS AT 02.02.09 ====================== 332.1 AS to provide a plan for the tape drives, given the new information from IB - an detailed plan was required immediately, showing minimum spend relating to no data etc. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. 333.3 JC and DC to write a paragraph which summarises the Tier-2 and Tier-3 positions. This need not come back to the PMB for discussion, but can be circulated once done. Response from DC awaited. A summary should be written down so that we have something formal to refer to. 334.1 ALL: to provide early drafts for the Quarterly Reports. Required immediately please. The PMB were warned that individuals would be 'named & shamed' next week. 334.2 RJ to confirm by email when the proposed three days for GridPP25 at Ambleside had been booked. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. INACTIVE CATEGORY ================= 282.8 RM to monitor how R-GMA and networking issues impact on GridPP as matters progress. RM advised that this item should be moved to the 'inactive' category as it will develop over the coming months. RM discussed the issue with Steve Fisher and advised that support of R-GMA is required whilst APEL is dependent on it. RM reported that he has spoken to SF and there is currently no change to the R-GMA situation - process ongoing. RM advised that a small amount of effort was going into R-GMA on APEL but for the long term he wasn't sure. The item needed to be kept here for review from time to time, and required to be re-visited around Easter 2009. The meeting closed at 2:00 pm. The next PMB would take place on Monday 9th February at 12:55 pm.