GridPP PMB Minutes 352 - 6th July 2009 ====================================== Present: David Britton (Chair), Tony Doyle, Andrew Sansum, Robin Middleton, Dave Colling, Pete Clarke, Glenn Patrick, David Kelsey, Pete Gronbech (for JC) Apologies: Sarah Pearce, Roger Jones, Steve Lloyd, Tony Cass, John Gordon, Jeremy Coles, Neil Geddes 1. Summary of R89 Status ========================= AS reported that since the last PMB, the moving of the Tier-1 had gone very well, and to schedule. The CASTOR service was up and was moving data; the batch system would be up today; the aircon was looking good. The aim was to keep the system stable, and if they achieved a clear run this week then they would start making changes after that. DB noted congratulations to the team on getting through this to date. 2. Tier-1 Hardware =================== Hybrid Tape Plan ---------------- DB reported that the background to this related to himself and AS looking at tape and hardware planning - a decision would need to be made, about April 2010 procurement, by the end of July 09. There was also a need to delay £1m of the hardware spend, into the next financial year. DB noted that not all pieces of the plan were in place yet. They had been converging on a 'hybrid' tape plan. AS reported that the original GridPP3 plan which had foreseen a move to TK10B drives this year was out-of-date and that the required capacity for 2009 and 2010 was considerably less than originally planned. AS had considered three options in the re-planning: Option 1 was was to stick with T10KA drives, which would give 5PB in a 10,000- slot tape robot, but this meant it would be very full with no headroom. Subsequent expansion would involve large costs associated with a step-change to either TK10B (all new drives) or T10KC (new media and drives). Option 2 was as per the original plan - upgrade to the next density of B-drives and compress the existing tapes - although this would provide headroom, it would still require the immediate purchase of 18-25 T10kB drives, which was expensive and which did not make good use of the recently purchased T10KA drives. Based on these options, DB & AS had considered (Option-3) that the most cost effective and flexible way forward was a hybrid service with CMS moving to B- drives and all other experiments remaining on A-drives. It was noted that CMS were the biggest consumer of media and this meant compression of roughly half the media to double density, and only buy enough B-drives to look after CMS - which would be 8-10 drives. It was noted that they couldn't share drives now that CMS was contained in T10kB drives in one structure. This was the cheapest solution, bearing in mind that beyond 2011 we would probably need to move on again, and C-drives would likely be available. DB advised that it was difficult to see too far ahead, and with the C-drive we would need to replace both drive and tape. It was proposed to continue using A-drives for all except CMS, and re- evaluate this strategy at the end of GridPP3. AS advised he could pencil-in a C- drive purchase into the plan in order to prepare for the future. DB asked if DC or GP had any comments? GP noted that it was difficult to comment without seeing concrete numbers, but he thought the philosophy sounded fine. DC noted that we would need to ensure they weren't right up against maximum, as they would need headroom on bandwidth. DB advised that the purchase was flexible if the funds were there; that it was a work-in-progress and details would be circulated. Delay of £1m ------------ DB advised that the other concern was how to handle the issue of delaying £1m hardware purchase - he was working on scenarios. DB noted that it should be possible to delay, yet still ramp-up the hardware. He would circulate info. 3. EGI-JRU Meeting Summary =========================== RM reported that there had been an EGI JRU meeting last Thursday in the UK. Status updates were provided, and the Agenda for the Council Meeting on 9th July was previewed. NG and AR were attending the meeting, in Amsterdam. NG had gone through the draft Agenda for this at the EGI JRU meeting. DB noted that he had a pre-meeting with NG. DB reported that EPSRC had endorsed NG's UK position. Outstanding issues related to Chairman, funding (Nikhef is the common fund administrator), budget, MoU, acting Director. They do have a quorum in order to launch the overall organisation. Although the UK had not yet signed at the time of the JRU meeting, JICS were expecting to do so at the start of this week (in good time before the 9th July Council Meeting), but countries would, in any case, be able to turn up on the 9th with signatures in order to get voting rights. The acquisition of funds would commence in October. DC noted that he had attended the recent meeting on SSCs (Specialised Support Clusters) at which they also had a presentation on EMI (European Middleware Initiative). The EU would be contributing funds in response to funding calls. RM noted that the gLite Consortium had also been discussed, but the middleware area was still confused and no decisions had been taken. RM advised that the EGEE early bird registration had been extended to 10th July. The EGEE review had gone well, all deliverables had been accepted, and a report was awaited. EGEE '09 would take place in Barcelona from 21-25 September. During EGEE '09 there would be a federation meeting on the Friday morning. 4. Week's Notes ================ - BDII and WMS failover to GU during R89 move DB asked how the failover had gone? PG noted no negative reports, it seemed to go quite well. TD also noted it had seemed fine, but there had been no high submission rate of jobs. - Final call on User-Level accounting and other security policies DK advised that he had circulated 3 x security policies, with a final call for comments. DK advised that any final objections should be lodged. The Accounting Data one was the one with most changes, in relation to multiple accounting data centres, and the document had been re-worded. The other one with most changes was to do with Publication Policy - some policies should not be published and the document had been re-worded. The closing date for major objections was 14th July. ACTION 352.1 PG was asked (in lieu of JC) to note to dTeam that the revised security policy documents were on their final call - deadline for major objections 14th July. - AHM abstracts by Friday (10th): Status? DB noted that he was preparing an abstract in conjunction with JC. TD advised that Sam Skipsey was preparing one on data management; and Stuart Purdie was preparing an abstract also. DC noted he was preparing one on STEP. - new Oversight Committee DB had circulated a note of the new Oversight Committee members STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) R89 migration has completed. Service is already transfering data and as our downtime expires we will naturally become available again. 2) Acceptance tests of the FY08 delivery are underway. 3) The Tier-1 drives and media have been installed in the new robot which will now provide the tape service. 4) The site needs to carry out a major network upgrade (to remedy problems they have had for over a month) this is scheduled for Tuesday 7th July. 5) There will be a number of interventions on the HEP robot this week while the non-HEP robot moves to R89. 6) R89 machine room cooling is working well and the environment is stable. Staffing: 1) The first experiment support post has been accepted and is making progress. The second post has interviewed and an offer is being prepared. 2) The EGEE PPS recruitment has been re-authorised and advertising is commencing. 3) The YII student (funded by ESC) will start this month. 4) The CASTOR d/b admin has started. Service: Other than critical components we were down last week. SI-2 ATLAS weekly review & plans --------------------------------- RJ was absent this week. SI-3 CMS weekly review & plans ------------------------------- DC noted that nothing was happening on the Tier-1 front; Tier-2 was ok and had passed tests; there had been downtime at Imperial so CMS didn't use them; RALPP had a few minor problems; all quiet at present. SI-4 LHCb weekly review & plans -------------------------------- GP reported as follows: 1. Separation of DIRAC services between different machines has begun. The aim is to improve performance and ability to support larger number of jobs. Discussion on our proposed model of hardware requirements with CERN/IT this week. 2. Failed jobs at Bham-HEP, possibly related to problems with software area on some worker nodes (GGUS ticket 50000 submitted). 3. Problem with grid certificates/CA on worker nodes at Cambridge (GGUS ticket 50001 submitted). 4. Possible resolution to problems transferring files from Sheffield to RAL - different worker nodes using different glite versions. Will upgrade all nodes over the next 3 weeks there. Outlook : 1. Waiting for operations to restart at RAL. 2. Waiting for production operations to restart at LHCb. 3. Chaotic user analysis. SI-5 Production Manager's Report --------------------------------- by PG in lieu of JC: 1) Lancaster and Sheffield have also produced reports on STEP09, the many of the other sites reported their performance at last weeks Storage Workshop (http://hepwww.rl.ac.uk/sysman/June2009/agenda.html ) that followed the HEPSYSMAN meeting. The common theme was, the lack of sufficient internal LAN bandwidth to the Disk Servers causing inefficient jobs. Many sites implemented network channel bonding which helped but it is likely that solutions based on 10Ge will be required in the future. Concern had been expressed by Glasgow that larger sites may require more than 1Gbps connections to the JANET network. 2) An EGEE broadcast last week pointed out that the GridPP VOMS service running at Manchester is one of 9 still running obsolete glite 3.0 based software. This will stop working shortly when a newer version of the VOMS-core client (1.9.x) is released. (It is now passing through the certification process and will be ready for release in 2-4 weeks). 3) We have run our second week with regional on-duty operations and again all went well. 4) A problem with some data not being exported out of the APEL archiver started on 27th June. As a result many sites are failing the SAM APEL publishing test. Accounting publication has been temporarily disabled. Christiania is on holiday so may take some time to fix. (https://cic.gridops.org/index.php?section=cod&page=broadcastretrieval&step =2&typeb=C&idbroadcast=41683) 5) The timetable for the WLCG STEP09 post-mortem workshop is here: http://indico.cern.ch/conferenceTimeTable.py?confId=56580. Our current plan is to have some of the Tier-2 coordinators present. We have yet to confirm who will attend from the Tier-1. 6) The tier 2 coordinators have been reminded that the quarterly reports need to be prepared and will be reviewed at the Dteam meeting on the 14th July. 7) LHCB have two outstanding issues at Birmingham and Cambridge. GGUS tickets 50000 and 50001. SI-6 LCG Management Board Report --------------------------------- DB reported as follows: Nothing problematic in weekly operations report Big ID problem potentially solved - MB RAL cream-CE? used by anyone other ALICE? ALICE dataflow numbers - sent to AS. 30-June: Quarterly reports from ALICE, ATLAS and LHCb (UK 1/3 of CPU in quarter). Also, SAM test did not show RAL is down. SI-7 Dissemination Report -------------------------- SP was absent this week. DB advised that a CERN Press Release had been issued on July 1st. DB would ask Neasan O'Neill about the GridPP one. REVIEW OF ACTIONS ================= 332.1 AS to provide a plan for the tape drives: this was being finalised this week. AS reported that his documents had disappeared and he would have to rework this. He would hopefully have it by Wednesday. Done, item closed. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). JC reported that there had been a change of responsibility within EGEE - he was checking on the best person to ask. Ongoing. 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. Ongoing. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - he needed to extract data but was busy with STEP at the moment. Ongoing. 350.2 DC to investigate the possibility of submitting an abstract to the AHM. DB would contact him. Done, item closed. 350.5 JC to check and verify that the contact list on the GOCDB is up-to-date - to be done by September. Ongoing. 351.1 DB to submit OPN Backup Link Document to STFC for consideration prior to the next OC. Done, item closed. 351.2 PMB to send comments to PC over the next week regarding the LHC Network Forward Look document. Done, item closed. 351.3 ALL: drafts of all reports for the OC to be sent to DB by 10th August (project status papers) and 17th August (all other papers). Ongoing. 351.4 DB to contact Dave Wallom and Andy Richards regarding a contribution to GridPP23 and report back. DB reported that he had contacted them - a joint NGS & GridPP session was not possible due to the NGS Summer Schools and NGS/EPSRC meeting taking place at the same time - people were unavailable. The idea had been abandoned temporarily but might be revisited for GridPP24 at RHUL. Done, item closed. ACTIONS AS AT 06.07.09 ====================== 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). JC reported that there had been a change of responsibility within EGEE - he was checking on the best person to ask. 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. AS had sent a reminder. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - he needed to extract data but was busy with STEP at the moment. 350.5 JC to check and verify that the contact list on the GOCDB is up-to-date - to be done by September. 351.3 ALL: drafts of all reports for the OC to be sent to DB by 10th August (project status papers) and 17th August (all other papers). 352.1 PG was asked (in lieu of JC) to note to dTeam that the revised security policy documents were on their final call - deadline for major objections was 14th July. 352.2 PG to raise at dTeam the issue of HEPSPEC06 benchmarking and corrected accounting records. AOB === TD asked about the Quarterly Reports? He noted this was also the end of Quarter for Tier-2 accounting, and asked about progress relating to updating the accounting records for consistency? PG advised that not much was being done at present. TD noted that this issue needed to be dealt with fairly soon and should be raised. ACTION 352.2 PG to raise at dTeam the issue of HEPSPEC06 benchmarking and corrected accounting records. The next PMB would take place next Monday: July 13th at 12:55 pm. Agreed further dates were as follows: July 13th - PMB July 20th - canceled July 27th - PMB Aug 3rd - canceled Aug 10th - PMB Aug 17th - canceled Aug 24th - PMB Aug 31st - canceled (UK Bank Holiday) Sep 2nd - special PMB to address issues re OC meeting Sep 7th - F2F at Cambridge