GridPP PMB Minutes 226 - 21 August 2006 ======================================= Present: Tony Doyle, Sarah Pearce, Roger Jones, Stephen Burke, Dave Newbold, Peter Clarke, Tony Cass, Jeremy Coles, Glenn Patrick, Robin Middleton, John Gordon, Suzanne Scott (Minutes) Apologies: Steve Lloyd, David Kelsey, David Britton, Andrew Sansum 0. Diary Dates =============== TD advised the meeting of the following dates to note: PPRP-GridPP - 6 September LCG OB - 11 September AHM - 18-21 September # UK e-Science All Hands Meeting 2006, Nottingham, UK, 18-21 September 2006 Tier-1/A Board - 22 September # EGEE 06: Capitalising on e-infrastructures, International Conference Centre, Geneva, Switzerland, 25-29 September 2006 LHCC Review - 25/26 September PPRP-GridPP - 8 November The PPRP and LCG meetings were confirmed. On the 6th September, an open presentation was required from 10-11 am (for GridPP), and a closed session would take place for 45 minutes in the afternoon. TD reported that the approach would be the same as for previous PPRP meetings - the first session was likely to be a question & answer session at high-level. Prepared answers to questions would not be required until the next meeting on 8th November in Swindon. All PMB members should be at that meeting, and prior to it, a set of prepared answers made ready. TD advised that Referees' Reports might be available before the September meeting in order to aid prepared responses, but questions were likely to be of a general nature only. Would someone from RAL be available on 6th September? Yes - this was pencilled-in. It was noted that DN and JG were in the USA that week. Neil Geddes (CCLRC), DB, SL and TD would be available for the closed afternoon session. GP would also be available. DB was happy to present the proposal as a whole on behalf of GridPP, with team back-up. TD also advised that during this period there will be discussions with LCG re requirements, that may need responses to specific questions in relation to the revised LHC schedule. 1. ATLAS deployment summary ============================ RJ summarised the current situation with ATLAS deployment. From June to August they had hoped to scale-up the Tier-0 to Tier-1 transfers and initiate Tier-1 to Tier-2 transfers. This was very constrained due to no available dCache disk space, which caused problems. For Tier-0 to Tier-1 they ended up trying to get decent throughput, but there was a very short lifetime on the data when it was sent to RAL - it didn't stay resident. For Tier-1 to Tier-2, the Tier-2 sites were ready to accept data, in principle all they had to do (after the throughput test) was for Tier-1 to register the datasets for both transfers and monitoring. This however didn't happen due to communication problems. The Tier-1 people did not do the registration, but a complete dataset on disk was not available anyway. At present they are working on Tier-1 to Tier-2 transfers just to prove that it can be done. The ATLAS worldwide work at the moment is production-based with regard to the upcoming exercises. Disk space continues to be a problem. Ten days ago, 2.75 terabytes was liberated - half a terabyte was used for ALICE, but 2.25 terabytes are not available yet as an emergency recovery is being carried out. RJ added that AS was trying throughput tests with 'odd' results. TD noted that this situation was not satisfactory - one deployment problem leads to the next and these are not being sorted out over the longer term. It is necessary to face the situation of not meeting the ATLAS service challenge requirements. Lack of disk space is also the underlying problem with regard to CMS and LHCb - TD will raise this at the Tier-1 Board meeting. DN reported that there had been a full review meeting at the last Tier-1 Board. The PMB agreed that this situation bodes ill for the future and that the current Tier-1 disk problem will not go away - it was recognised that new disk would barely meet the requirements, and this situation is likely to continue for the next 18 months (and has already lasted over a year). RJ felt that the issue needed to be raised as a priority. JG will pass this issue back to AS, and the Tier-1 Board will need to review the experiment allocations. The PMB recognised the various failures for the LHC experiments, and noted that difficult decisions will have to be made should disk resources continue to be limited. 2. AOCB ======== None. STANDING ITEMS ============== SI-1 News Items and Meeting Dates ---------------------------------- SP reported that information from DN regarding CMS transfer of data was posted last week. GP sent round an email regarding the User Board, has contacted Stuart Paterson but there is no progress as yet. SP only needs 3-4 paragraphs plus an image from him, hopefully this week - this was agreed. Over the next week, SP will require something from DN about his experience of being the UB chair - he agreed to provide this. SP advised that a KITE Club Research Facilities Workshop was taking place at RAL on 11th September, and she asked if anyone from RAL was going? She wished to negotiate to run a real-time monitor somewhere but needs to know if the main area in the marquee has power. RM hasn't decided whether or not he is attending, but will let SP know. Regarding the talks for Supercomputing, SP asked if RJ had any suggestions for improvements from last year. He will let SP know. SI-2 Production Manager's Report --------------------------------- JC provided the following report: 1) T1 CASTOR to T2 disk tests ran to various sites last week. A good rate was seen to RAL PPD but other sites saw poor (~20Mb/s or 60Mb/s [DPM] to ~150Mb/s [dCache]) rates. There is a suspicion that the RAL firewall (which may be "too" stateful) may be limiting rates but there are also tuning issues being uncovered which once resolved will improve throughput rates. While initial transfers may succeed the tests indicate that they frequently get stuck. 2) At the end of last week 2 disk arrays were swapped with new disks. One array initialised okay and has run tests over the weekend without issue so far. The second array failed due to a card issue (not the disks) and were installed into another array which initialised okay but is not yet tested. The full set of tests will not complete until the end of the week at which stage we should have a better idea if the new disks resolve the problem. 3) The air conditioning equipment which has caused a few problems this summer has had its soft cutoff raised which should mean that the machine room can now cope with an extra 4 degrees outside. 4) We are encouraging all who need to to reapply for dteam membership via VOMRS. It is likely that the "voms.cern.ch service, which contains old LDAP VO entries" will cease on Monday October 16th at 12hrs noon. This impacts Alice, Atlas, CMS, LHCb, DTEAM. 5) CMS T1-T2 testing (David Colling reported at UKI ROC meeting last Wednesday http://agenda.cern.ch/fullAgenda.php?ida=a063172): Main problem is not installing experiment software but keeping sites up - sites are failing frequently (the is seen at all sites around the world). We have not yet succeeded to get T1 to T2 transfers underway for ATLAS (see discussion elsewhere). 6) The Footprints problem mentioned last Monday was due to a disk failure on the Footprints server. 7) Last week it was noted that T2 disk was rather low as seen under the monitoring page: http://www.gridpp.ac.uk/storage/status/gridppDiscStatus.html. Explanations were given. Most issues have been resolved and the disk total is back at previous levels. Edinburgh is continuing with its dCache reinstallation. SI-3 LCG Management Board Report --------------------------------- There was no report because no Management Board meeting last week. SI-4 Documentation Officer's Report ------------------------------------ SB reported that a webpage with Use Cases on it was in the planning stage - this is the main activity at present. REVIEW OF ACTIONS ================= 220.3 SB reported that the Workload Management System Documentation was complicated as there were four or five different documents which appear contradictory. These may be tractable at some level. TD advised that detail was not required, only a high-level overview. SB noted that the work was ongoing but no conclusions had been reached at present. 224.1 RJ reported-back regarding the broadcast connection within ATLAS. Contacts were now on an operations mailing list which will have relevant personnel on it, but he felt that someone from each area should be a designated contact. Item closed. 224.2 JC reported on the EGEE operations meeting - they were revising the broadcast mechanism. Item closed. 225.1 TD will email the Tier-1 Board this week regarding DN's extension. He will also have a discussion with DK regarding the meeting agenda for September. This item is ongoing. 225.2 JC reported that deployment testing had been discussed at the last meeting and will be a regular feature from now on within the D-Team meetings. Item closed. 225.3 This item related to resolving the overall resources at Tier-1 and Tier-2 - an ongoing item as reminder until the Tier-1/A Board meeting on 22 September. DN reported that he needed to estimate from the experiments and figures the GridPP3 planning assumptions. It was asked whether a spreadsheet was available for GridPP3? DN said the information could be summarised in a spreadsheet. DN to do. It was noted that this needs split by site, disk-CPU, and bandwidth. Ongoing. 225.4 GP had spoken to Stuart Paterson (see SI-1 above). Item closed. ACTIONS AS AT 21.08.06 ====================== 220.3 SB to report on the Workload Management System Documentation. 225.1 TD to email the Tier-1 Board regarding DN's extension; also to talk with DK regarding Sept meeting agenda. 225.3 DN to provide summary spreadsheet of T1-T2 networking based on GridPP3 figures; the Tier-1 and Tier-2 future resources planning remains an issue until the Tier-1/A Board meeting on 22 September. 226.1 TD to raise the problems of LHC experiments' disk space at the Tier-1 Board. Prior to the meeting closure, it was reported that Steve Traylen was moving to CERN soon, and this will affect deployment. The PMB agreed that they would all like to thank Steve, as a key person, for all the work he has done. Everyone owes a great debt of gratitude to him in what has been a 'lynchpin' role affecting every area of deployment across the whole of LCG. The PMB unanimously offered Steve their thanks for the major contribution he has made to the project. Apologies were given for forthcoming meetings. There would be no PMB meeting on 28th August. The next PMB would take place on Monday 4th September at 1.00 pm, primarily to review PPRP preparations.