GridPP Deployment Board Minutes 04 - 3rd April 2009 ======================================================= Present: Steve Lloyd (Chair), Jeremy Coles, Phil Clark, Roger Jones, Alessandra Forti, Pete Gronbech, Duncan Rand, Andrew Sansum, Stuart Wakefield, Tony Doyle, Dave Britton, Graeme Stewart, Raja Nandakumar, Glenn Patrick, Dave Kelsey, John Walsh, (Suzanne Scott, Minutes) In attendance: Sarah Pearce Apologies: Pete Watkins, Dave Colling, Derek Ross, James Catmore, Andy Richards 1. Minutes of Previous Meeting =============================== The previous Minutes were accepted, with the agreement of a wording change in (8) AOB: changing: "JG wanted experiments to push out software in the same way that middleware was being distributed" TO "JG reported on the plan to push-out middleware". It was noted, for the record, that nothing was emerging from this anyway. 2. Actions & Matters Arising ============================= 01.9 TD as Technical Director to highlight to Ganga that they should document issues in relation to error messages to users. PW noted that they are trying to improve user support with Ganga. PW will follow this up with the help of RJ. [This action was in relation to the general thrust to get error messages onto a webpage.] TD reported that he had got in touch with Ulrik, who reported that improving the quality of error messages was ongoing as part of general improvements - things were better on the proxy side, and had also improved on the ATLAS side. For LHCb there was increased co-ordination between teams. DONE, item closed. 01.10 JCatmore to begin to devise a comprehensive list of error messages that would assist users. JCatmore reported that he had been in touch with Johannes over this, a wiki page had been agreed, meantime Ganga version 5 has been released, which has solved some problems but generated more. JCat will be working on this over the next few weeks and will provide links. DONE, item closed. 01.13 GP to report-back on the status of MICE. GP has not received any recent report from them, but they have started using resources. AS reported that he had had contact from them - he had requested a clear statement of what was required at the Tier-1 in terms of architecture, but had received no definite response. GP noted that he had asked them to write a brief summary of their intentions. It was noted that we needed better contact with them. GP reported that there was better communication now, through Paul Kyberd. GP will speak to DC to check. AS noted he was aware of requirements. DONE, item closed. 01.15 DC to provide updated LondonGrid MoU. It was reported that a meeting would be held soon, and it would be agreed then. DR reported that there was a draft MoU coming soon - ONGOING. 01.16 RJ to provide updated NorthGrid MoU. It was reported that the final copy was not yet signed. They had changed the dates last time, on the DB pages, the change had been accepted by NorthGrid. RJ advised that this did not need to be re-signed - it is valid. Updates are done to wording generally as required. DONE, item closed. 01.17 PC to provide updated ScotGrid MoU. This had been updated and circulated, it was being sent to Finance Departments. TD advised that the GridPP MoU could be signed off. ScotGrid could sign the GridPP MoU. It was agreed to check the right version was on the website - this was to be downloaded and signed. It should be scanned and sent to SL. 01.18 PW to provide updated SouthGrid MoU. PG reported that they were changing the figures and this needed to be followed-up. ONGOING. 03.01 AS to investigate LFC recovery for ATLAS. AS reported that they had tested the recovery of LFC backups to the database, and it had worked OK. They were thinking about onsite distribution. SL noted that this should be part of the resilience and disaster planning. DONE, item closed. 03.02 JC to follow-up on the LHCb calendar view of downtime, to see if it would be useful elsewhere. JC reported that this had been done and ATLAS had found it useful. The GOCDB had improved the interface. GS noted that WLCG would take this on as useful for VOs but that there were still GOCDB issues. The issue itself was ongoing, but the action was done. DONE, item closed. 03.03 JC to take the issue of a UK repository to the dTeam to discuss balancing the convenience of having it, with the effort required to set it up. JC reported that he had raised this at dTeam on 9th September - there was a repository at Manchester. AF advised that there were 3 versions of SL and 3 of gLite plus external RPMs that the dTeam can use if necessary - was there wider need? JC suggested that there was no overwhelming support for a repository. TD noted that we wanted to prevent people getting bad versions of releases. It was understood that dTeam were dealing with this. DONE, item closed. 3. Accounting Issues ===================== This issue arose out of an email discussion triggered by GS re QMUL & Glasgow numbers. GS presented on the APEL figures vs the Production Database: - it was difficult to get figures out of the Production Database - there is a broken CPU scaling applied to CPU seconds - GS had tried to convert to HEPSPEC 2006 - he had looked at the fraction of KSI2K delivered There was a discussion of the results and discrepancies. Some sites were under and some were over-reporting. The APEL figures were wrong and the results were very different from one another. GS proposed to fix the problem: - all sites should benchmark their CPUs using HEPSPEC06 - sites should use the batch system logs to calculate their HEPSPEC06 hours - this could then be followed-up by a cross-check with VO accounting It was agreed that we need to get the numbers right for the next distribution of hardware. PG noted that we should also go to the sites that are doing well, are well off, and check what's going wrong there. SL asked if dTeam could co-ordinate this? Could we do it for other experiments besides ATLAS? DB suggested the following strategy: 1. we do this tracking again for another month 2. we check another experiment 3. examine the sites in detail SL commented that we couldn't debug this now. RJ reminded that this was the DB, and this was a policy problem - how it is sorted out is a dTeam issue. GS suggested that we should be using HEPSPEC06. AS advised that the way sites have been asked to deliver against the benchmark can't be applied retrospectively. TD agreed, noting that we have to work with the new benchmark, and it would help a lot. SL suggested we understand a couple of months' figures for all experiments, then apply it retrospectively if everybody agrees. The following actions were agreed: ACTION 04.01 GS to do the same analysis for another month 04.02 GS to check Manchester and Glasgow figures 04.03 RN to look into LHCb figures 04.04 dTeam to get everyone benchmarked 4. UK Priority Resources ========================= SL reported that we had set aside a nominal 20%, and noted there is some headroom over what we've pledged. RJ advised that space tokens were set up for disk, for CPU we could ask for queues. TD asked whether we could associate a UK attribute to individual ATLAS users? AF advised that the user has to generate a proxy specifying ATLAS UK. RJ asked how the Netherlands do this, or the Germans? Access to DGrid must have been solved? SW advised that it is set up in VOMS. It was agreed that we want to implement this but don't know the best way to do so - the question is about CPU shares. ACTION 04.05 JC (therefore dTeam) to investigate what other countries do (re CPU shares and priority resources) and come back to the Deployment Board with a proposal. 5. Regional Operations ======================= JC provided a brief report for info: JC reported that this meant extra work in terms of running the Grid - and taking on extra EGEE operators. JC noted changes with regionalisation (eg: GOCDB, the accounting, dashboard etc). 6. Security Policy Update ========================== DK reported on JSPG and security policy for EGEE and WLCG. DK presented as follows: - aim was for general common policies usable by many Grids - tackle security problems of sites - there were 2008 approved policies for EGEE and WLCG - there were two draft VO policies: VO registration security policy; and VO membership management policy - feedback on VO registration - a good response had been received from the UK - feedback on VO membership: several VOs had concerns; there was the issue of data privacy - user level job accounting: there was a draft Grid policy on the handling of user-level job accounting data; could sites turn-on their user level accounting? JC advised that this was being discussed next week. ACTION 04.06 JC to report-back on the discussion of sites turning-on their user level accounting - there was a VO Portal Policy - a new draft policy document - other issues: EU Grid PMA AuthZ working group; identifying management via federations - future JSPG plans: meeting, policies, and revisions ACTION 04.07 DK to check that all is up-to-date in terms of GridPP Security Policies - email DB. If there are any issues, DK to let DB know. DK advised that all of the documents discussed were still being worked on, so now was the time for comments & feedback. 7. Deployment Issues from Experiments ====================================== ATLAS ----- RJ reported that glexec was delayed. DB reported that SCAS and glexec were both late - there were missing elements in functionality, memory leaks were being fixed. RJ advised that GS, Peter etc had been doing HammerCloud tests re analysis at sites, showing configuration issues - there were still things to be done but sites were responding. GS noted that space tokens had been fixed. RJ noted resources were OK. GS reported that batch systems were being worked on via changes at RAL and comparisons. CMS --- SW reported that core service nodes had stagnated due to Mona being on maternity leave. He cited several issues: - there was general concern related to storage - once per month 1000 user jobs were hitting storage at the same time, which was causing the system to fall over - Imperial were low on storage as opposed to CPU - CMS at the Tier-2s was considering deploying clusters, and whether DPM and dCache could scale for the Tier-2 - maintenance of systems was a high-level of workload DB advised that it was GridPP policy to reduce storage options, yet have since added more (e.g. Storm). There was clearly a need to rationalise but the solution was as yet unknown. SW commented that no-one had agreed on dCache or CASTOR. TD advised that there had been a comparative performance study reported at CHEP. DB asked whether there was something at policy level we could do? The conclusion was that we do need a storage discussion 'day' - what would be favourable timing? SL suggested that we needed the new Storm to be deployed first. ACTION 04.08 TD to discuss organising a storage meeting with Jens Jensen in order to ensure a storage meeting does take place ~June '09. LHCb ---- RN reported that major issues related to known problems with SQLite that locks on opening. This was being dealt with on a site-by-site basis. Other things to note were problems within LHCb software (they were working on it); there was a large production run soon and a new release of LHCb software (they will run a limited test beforehand). Other Experiments ----------------- GP reported that for ALICE the main issue was the CASTOR upgrade - they would prefer 2.1.8 at RAL. There was a large production coming up fairly soon; there were some security issues. For MINOS - they had asked for an increase in the Tier-1 allocation for 2-3 months - this was not a big issue. For T2K the issue related to tests at ISIS at RAL, and moving data to Lancaster - this had ended up being a dodgy switch at RAL. There was also the issue of the computing model for T2K. For NA62 - they were awaiting the proposal result from the PPRP, and were doing GANGA work. DB considered that we could give them what we could without compromising. GP also reported that there was old data from old experiments - around 5000 files - in the ATLAS data storage that needed to be deleted. ACTION 04.09 DB to digest the ILC info and circulate, in order to disseminate, and to address problems identified. The meeting broke for lunch. 8. Deployment Issues from the Tiers ==================================== LondonGrid ---------- DR reported that the biggest issue was manpower, although this was improving. Brunel, QMUL & RHUL had employed new SysAdmins, who were going through a training period at present. Other issues related to the performance of the SE; there had been a successful transition to new rooms; they were getting Storm to work OK; JC asked why the Grid had lost RHUL for five weeks? DR explained that the cluster had been down and it took a long time to work out what the problem was - ClusterVision would be assisting them soon. JC observed that the move from London to RHUL may be an issue soon. DR advised that RHUL had bought a new cluster, there had been no room for it, and space meantime was being provided by Imperial. (They were planning to attend to this after the end of the accounting period). NorthGrid --------- RJ reported problems with gLite; there was a management issue re the end of the Liverpool support post when the funding runs out - RJ was dealing with this. AF advised of accounting issues at Manchester. JC asked if the cooling problems at Liverpool were ongoing? RJ couldn't confirm the details. AF reported that the cooling was old and the University had agreed to replace the water cooling system, Liverpool were doing well anyway at the moment. RJ observed that the water cooled racks had been retro-fitted. ScotGrid -------- PC reported that hardware had been procured at Glasgow & Durham and was now in place; there was full manpower at Glasgow; there was a new position at Edinburgh, funding someone to focus on HEP - they were shortlisting at present. Durham was working well. There had been recent power issues at Glasgow, outside the control of the University. TD advised that the pre-existing water cooling infrastructure was a problem - MK and DM had been dealing with this, and there was better monitoring than we ever had before - there was a variety of modes of failure. PC reported that the setup had changed at Edinburgh and they were going back to the RAL BDII. Re ECDF - fairshare was a limiting factor. GS noted that they were investigating storage issues at Glasgow. SouthGrid --------- PG reported that all sites had bought hardware; there had been staffing changes at Birmingham; there was a new Deputy Co-ordinator in post. JC asked whether Birmingham had had network problems? PG noted yes, a bad network switch had brought down the cluster. There was a discussion of SRIF3 money and its successor funding body CIF - it was advised that all should check this. Tier-1 ------ AS reported the following current issues: - closing old gLite and stop experiments using it - moving to production pool accounts - WMS02 becoming LHC only - migration of FTS & LFC to Oracle rack planned - move software distribution to AFS - SL5 was up and running - they were working on testing and coming up with a schedule - Tier-1 migration was due - STEP was happening AS reported the following current problems: - there were problems with the WMS - there was an FTS delegation problem - 3rd party repository changes - DNS problems - the scale of the Tier-1 itself: disk servers (350+) meant making configuration changes difficult - they will re-engineer fabric management over the next 6 months Overall, AS reported that service stability was good; migration to the new machine room was now planned as the building had been 'accepted' (it met the tender spec). Two sets of consultants were currently doing machine room measurements. A decision would be made by 1st May; migration by 22nd June. AS noted that a CASTOR upgrade was pending, a test of CASTOR 2.1.8 was required - this would be a PMB decision. TD advised at this point that we needed to review the functions of the Tier-1 Board, which had been devolved to the Deployment Board for a period of 6 months initially. We needed to ensure that the Tier-1 delivered its objectives, and this was being handled in various different ways at present. DB noted that it was better to handle issues in an ad-hoc way. For example, if we needed to decide to upgrade to CASTOR 2.1.8 then a meeting would be called to discuss this. TD agreed, noting that the Deployment Board, supplemented by ad-hoc meetings as required, was a good model for going forward. DB confirmed that it was the role of the Deployment Board to receive reports on a regular basis. It was agreed to proceed as is at present - this works well in conjunction with ad-hoc arrangements. It was noted that international views could be sought or invited as appropriate. 9. Site Performance Issues =========================== SL noted that we had already discussed many issues in terms of storage - were there any other issues? None at present. 10. Future Network Requirements ================================ RJ asked whether this was a site issue? Or should we just upgrade the network? LAN to WAN connections? TD advised that it was usually left up to the site to decide. SL commented that we need real data from real users in order to decide this. GS noted that because of upgrades at Glasgow the LAN connection could be compromised. RJ noted also that the Tier-1 to Tier-1 traffic could be a problem. SL saw no reason to go above 1gig at the moment - and that we should wait to see what happens. TD asked if there had been any progress with GridMon? ACTION 04.10 SP to contact Robin Tasker re GridMon update (it was noted that this was also a PMB action). 04.11 Re future network requirements, dTeam to look into iperf as an alternative backup, and report-back. ACTIONS AS AT 03.04.09 ====================== 01.15 DC to provide updated LondonGrid MoU. It was reported that a meeting would be held soon, and it would be agreed then. DR reported that there was a draft MoU coming soon - ONGOING. 01.17 PC to provide updated ScotGrid MoU. This had been updated and circulated, it was being sent to Finance Departments. TD advised that the GridPP MoU could be signed off. ScotGrid could sign the GridPP MoU. It was agreed to check the right version was on the website - this was to be downloaded and signed. It should be scanned and sent to SL. 01.18 PW to provide updated SouthGrid MoU. PG reported that they were changing the figures and this needed to be followed-up. ONGOING. 04.01 GS to do the same accounting analysis for another month 04.02 GS to check Manchester and Glasgow accounting figures 04.03 RN to look into LHCb accounting figures 04.04 dTeam to get everyone benchmarked 04.05 JC (therefore dTeam) to investigate what other countries do (re CPU shares and priority resources) and come back to the Deployment Board with a proposal. 04.06 JC to report-back on the discussion of sites turning-on their user level accounting. 04.07 DK to check that all is up-to-date in terms of GridPP Security Policies - email DB. If there are any issues, DK to let DB know. 04.08 TD to discuss organising a storage meeting with Jens Jensen in order to ensure a storage meeting does take place ~June '09. 04.09 DB to digest the ILC info and circulate, in order to disseminate and to address problems identified. 04.10 SP to contact Robin Tasker re GridMon update (it was noted that this was also a PMB action). 04.11 Re future network requirements, dTeam to look into iperf as an alternative backup, and report-back. Next Meetings ============= It was noted that a hybrid meeting was expected to deal with the issue of CASTOR 2.1.8. A storage meeting was also expected. The next PMB F2F and DB meetings would take place at Clare College, Cambridge: 7-10 September 2009.