GridPP Deployment Board Minutes 06 - 16th April 2010 RHUL ========================================================= Present: Steve Lloyd (Chair), Jeremy Coles, Phil Clark, Roger Jones, Alessandra Forti, Pete Gronbech, Duncan Rand, Tony Doyle, Dave Britton, Raja Nandakumar, Glenn Patrick, Dave Kelsey, Ian McArthur (pp Pete Watkins), Dave Colling, Derek Ross (Suzanne Scott, Minutes) In attendance: Sarah Pearce Apologies: Graeme Stewart, Andrew Sansum, James Catmore, Stuart Wakefield, John Walsh, Andy Richards 1. Minutes of Previous Meeting =============================== The Minutes of the last DB meeting, which took place on 10th September 2009 at Cambridge, were accepted. 2. Actions & Matters Arising ============================= ACTIONS AS AT 10.09.09 ====================== 01.15 DC to provide updated LondonGrid MoU. It was reported that a meeting would be held soon, and it would be agreed then. DR reported that there was a draft MoU coming soon. It was reported that another iteration was happening - a new version should go out shortly. ONGOING. Re: accounting issues at sites: 04.03 RN to look into LHCb figures. Done, action closed. 04.07 DK to check that all is up-to-date in terms of GridPP Security Policies - email DB. If there are any issues, DK to let DB know. DK reported that the GridPP Security Policy phase was ongoing at present, however other policies had been approved by LCG. ONGOING. 04.10 SP to contact Robin Tasker re GridMon update (it was noted that this was also a PMB action). SP reported that discussions were ongoing - they had effort at Glasgow to work with Mark Leese on GridMon, but this had been difficult as no response had been received from Mark Leese for some time. ML was currently working on a database. ONGOING. 04.11 Re future network requirements, dTeam to look into iperf as an alternative backup, and report-back. It was felt that this was re-inventing the wheel, but it was ongoing. ONGOING. 05.01 RJ to provide an updated NorthGrid MoU (requires to be modified in relation to EGEE/EGI). ONGOING. 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). ONGOING. 05.03 The Accounting & Benchmarking sub-group, comprising SL, JC, and the four x Tier-2 Coordinators, would meet on 2nd October to discuss the figures and follow the action plan as outlined above (see points 1-7 reproduced below, with the note following). ----------------- DC suggested that setting up a sub-group to handle this specific issue would be useful. DB agreed, proposing the four tier-2 coordinators plus a couple of senior people (including SL) to moderate - there should be a proposal to go to the Deployment Board. DC proposed the following course of action: 1. specint them all 2. calculate what we can 3. adjust the ones we can't 4. compare the adjustment with those who haven't done it properly 5. if within 10% then ok 6. set-up a sub-group comprising JC, SL and the four Tier-2 Coordinators 7. agree timescale Figure was £400k this financial year, from STFC. One month only could be allowed for convergence, as time was short - proposed date was Friday 16 October. SL advised that it should not advantage sites who can't do it. Decisions should be referred to the PMB. This was all agreed. ------------------ Done, action closed. 05.04 dTeam to publicise the 1st October as the changeover date to HEPSPEC06. Done, action closed. 05.05 DB to co-ordinate the pledge report to wLCG once all is agreed. Done, action closed. 05.06 RJ, DC, and RN to provide to Neasan O'Neill a link to each of the experiment's information on UK user help. These links to be added to the GridPP website by Neasan O'Neill. Done, action closed. 3. GridPP4 Analysis Performance Accounting =========================================== SL advised that we need to discuss how to count things in GridPP4 - there will be 2 x hardware tranches and a manpower review after year 2. Metrics were being driven by the experiments. DB advised that parameters had to be calculated. PG noted that a 'tweaking' factor would be required for small sites. DB advised we would consider 'value for money' - we would look at the efficiency and see if it was an issue or not - efficiency would be taken into account. PC noted there would be bigger sites next time. DC added that bigger sites have bigger energy and more cooling requirements etc - they put in more money therefore we couldn't penalise them. TD noted that in 2007, Manchester weren't contributing to LHCb, but now that was no longer the case. SL suggested we shouldn't revisit the past - what would we do for the future? PC noted that he liked the numerator (delivered resource / GridPP funding) argument. There was a general consensus about this. SL asked however what you could do with that number? IM suggested a bonus system. SL thought that the difficult bit was analysis performance. ATLAS ----- RJ noted that the process agreed should be defined by 2011 as the accounting year was 2012. PG agreed that we need to define the rules before the accounting period. RJ advised that sites should deliver for the client - how you measure that is another issue. PG thought however, that if you don't know how you're being measured then it isn't fair for the site. RJ advised that there wasn't a well-defined solution yet - delivered performance should be measured, but if sites don't do well, then jobs won't go to them. We need to look at site performance and delivered capacity. For ATLAS, if it's in a space token they can see it, then they can ask whether it is performing in the right way. The hammercloud tests see if a site can deliver the capacity required, which relates to rate sent and fraction etc. What sites have to follow is what goes on in the Tier 1, 2, and 3 jamborees. ATLAS will be collating in both passive and active mode, and will define its own metrics. ATLAS ADC will define the relative metrics. SL asked if this would be measured on ability? RJ advised it should be an equilibrium situation, but there needs to be measuring of availability, performance, throughput etc. RJ noted that what was in the hammercloud at present was a good indication of what will be used in the future, but the metrics don't change - the same quantities will be looked at in a different way with a different probe. PC asked if this was wallclock I/O time etc? RJ advised it related to analysis rates and throughput: efficiency (job failure rate), event rate, fraction within the cloud (available capacity measurement) - all combined into a single score. PC asked how frequently would they run the hammercloud? RJ noted he didn't know yet, a hard hammercloud could be intrusive. TD advised that we need to know the survey period in advance. SL commented that sites should be ready from now on, at any time. The period in question should be the entire year before the next allocation. DK noted it had to be seen to be fair. DC asked if this meant there would be no additional hardware money until 2013? TD confirmed that the experiments had said to leave it till later - the hardware allocation would be made in two years' time, April 2012 - we would need to organise this in 2011 at the latest, now that we know the LHC schedule, this has driven it to the calendar year 2011, but we continue to be constrained by the LHC schedule. SL also noted that we can't test for too long, as the metric will change. It was agreed that the accounting period would be calendar year 2011 with the first allocation in April 2012. CMS --- DC advised that they had talked about this - one of the metrics under consideration was that each Tier-2 site gets a certain amount of credit for what it does. Disk space would be allocated appropriately as they support different analysis groups. Over time the groups will migrate to good sites, and away from bad ones. Another metric under consideration was throughput, however this was complicated and they needed to be careful about it. Throughput and CPU efficiency should be calculated in a way that can be agreed on in the UK - there was no equivalent in CMS of the hammercloud tests. DC advised that credits were being done by the CCRB according to a formula - there were 42FTE credits it could give out - good sites would get the analysis groups. DB asked about the possibility of Bristol getting back in? DC confirmed yes, however they have asked not to at present. LHCb ---- RN advised that analysis didn't take place at the Tier-2s significantly, however if a site wants to, they have a set of requirements to satisfy, but it is the site that is expected to provide additional resources. The primary issue was manpower - the work had to be done at the site. TD noted we could hold back the fraction. SL asked if any more needed to be discussed re LHCb? There were no other issues. JC asked when sites would get feedback? SL advised that the performance was being monitored now. JC asked if there would be a quarterly review of site performance? TD confirmed we did need to check performance in relation to metrics, this was also part of the information to the experiments. SL noted that we needed to report things that were relevant. SP considered it good to work with the experiments in order to improve metrics. SL asked what the metrics were for? Not all sites were in CMS or ATLAS? The hammercloud tests also didn't show up on the Project Map. SP asked if we wanted to introduce a site summary for each site? TD thought it might be better if the sites did this. SL noted that ATLAS had to do it - ATLAS give the performance, then the sites have to argue it, if they disagree. RJ commented that sites should be able to self-monitor. SP suggested we start with this quarter, Q210. PG asked if we could have the urls of tests being done? This would give info on what was being measured. ACTION 06.01 RJ/Graeme Stewart to provide urls of the place(s) where info is located re ATLAS site tests and measurements (so that sites understand what they're being measured on). 4. Framework Policy ==================== JC had circulated a suggested policy. The main aim was not to fall foul of a bad release. This needed to involve many of the sites, and we had to agree the timetable for how this would work. DC noted that this was less of an issue now that we had fewer big bang releases; JC noted that there were obviously different levels of importance for releases. The key was to be neither the first nor the last to release. The variability of site releases provided a protective effect if a release caused problems - 'Resilience through apathy'. It was noted that there were several aspects to the policy - for example, dealing with the middleware when it came out, defining when sites should upgrade etc. Effectively the staged process was happening already. DC considered that upgrading was generally well managed already by the DTeam. DB noted the flowchart to be followed, was agreement required? There was no conflict between what was being proposed and what usually happens. JC asked if people were happy with what was written? It was management also, not just the release process that counted. He also asked about the certification process? There ensued a discussion on DPM and SL5. DB commented that we don't want dragged into a certification process, with which we are not involved. SL asked if there was any objection to this Policy? It didn't seem to be in conflict with that is already being done? It was agreed. 5. Regional Monitoring ======================= SL reported that since the last DB meeting, we had moved to this - were there any issues? JC advised that they were still fixing bugs. PG noted that the timescale was difficult, however it was running at Oxford at present. JC advised that they thought it would stay in Oxford but it might not due to NGI. DB noted that we are an NGI and we are funded, however this wasn't an institute issue. SL noted there weren't really any issues to discuss re Regional Monitoring. PG asked if everything was going ok at present? JC reported that one site had said that email clarity wasn't good; other problems had been implementation of the Nagios service itself. 6. Storage =========== SL asked if there were any issues generally that needed to be discussed here? TD thought we should consider procurement: if we improve on analysis efficiency, we should go to Raid 0 on the WN disks in order to allow throughput at reasonable cost. The question was, to what extent within analysis at the Tier-2s over the next few years will there be random access mode? RJ noted he couldn't reply to this - he thought it was the default mode. DC reported CMS were moving to many more cores, possibly 24 cores, therefore accessing a couple of disks makes a difference. PG thought the more cost-effective solution was to go to two disks and raid them, this didn't need any work on the part of the experiments. DR commented that it was all up in the air at present, with no firm conclusions. TD also noted memory requirement per core was nominally 2G per core - should we increase this? DC advised that moving from 4 cores to 6 cores was a big increase in cost. TD commented that on the storage nodes, there was the potential to use large storage nodes with 2 x TB drives, giving good throughput - were there any conclusions about this? PG advised that people who had put 10gig in weren't using all that bandwidth. The other option was multiple channel bonding, which was more cost-effective. DC noted this depended on disk efficiencies and how you configure the raid controllers. DR thought we should wait for a few months and see how it all works out. 7. Deployment Issues from the Experiments ========================================== For ATLAS, RJ noted nothing to report - disk was being put in place at present. For CMS, DC noted the only thing to report was as discussed at the PMB. For LHCb, RN advised that the main issue was that of uploads failing generally, and it would be good to follow this up. He requested JC to follow-up - 50% failure rates were unacceptable. JC confirmed that Glasgow were following this up at present, it was happening at Glasgow, also at Sheffield, Brunel and Lancaster. TD noted that this was the argument against having all different SRMs - it was difficult to pinpoint the problem in this configuration. For 'Other' experiments, GP reported that nothing much was happening; GP would need to check with them re their future activity, but there were no known problems at the moment. 8. Deployment Issues from the Tiers ==================================== LondonGrid: DR advised that the staff issue had been resolved. At Imperial the SysAdmin post was going well - work had been done on dCache and tuning. At Brunel the new machine room looked good, there were no real issues there. QMUL were looking good - an extra SysAdmin should be on the way. RHUL had moved their cluster from Imperial to the RHUL campus, all was ok, and they were looking to bring it back online. There was a networking issue, as it was shared with the campus. UCL Central was currently being phased-out - SL should not include UCL in his plots now. NorthGrid: AF reported that apart from what was discussed yesterday, there were no further issues to discuss. SL asked about management meetings? RJ would check with the other members. ScotGrid: PC reported that all was running well - ScotGrid Durham was delivering a lot of CPU, but there was loss of staff - Durham has lost 1 staff and about to lose another in July. ECDF were using tranche 2 funds now and were regularly exceeding their share - they had managed to get £30k from the University to purchase extra storage, and they would get billed by share now. At Glasgow things were going well, they were delivering a lot of CPU, new kit this September will add a Petabyte which would triple the cluster. In summary, PC reported that there were only 3 sites but they were regularly delivering 10 million hepspec hours per month. JC reported that biomed had been dropped as they don't respond to shares. SouthGrid: PG reported that Birmingham, Bristol and Cambridge had upgrades of disk and CPU. Oxford was stable, they were going to tender for new kit and were doing production work. RAL had lots of kit but the air-conditioning issues were ongoing. Tier-1: DR reported that CPU tenders were going out; they were setting up testbeds for services; they were working on Quattor; there was access to the Farm for non-LHC VOs; they were setting up a queue for ATLAS; ALICE was going to be using the Cream CE; there were top-level BDII issues and they were looking at deploying gLite 3.2. 9. AOB ======= TD asked about the continuation of the Deployment Board itself? It was agreed to carry on until after the PPRP meeting outcome. DB suggested that a meeting of the PMB and dTeam might work instead, as issues overlapped. PG thought it would be useful for passing on info. DB advised that we should discuss this further at GridPP25 at Ambleside, before changing to another type of meeting. There were no other issues. The next Deployment Board meeting would take place at Ambleside (GridPP25: 23-26 August 2010). ACTIONS AS AT 16 APRIL 2010 =========================== 01.15 DC to provide updated LondonGrid MoU. It was reported that a meeting would be held soon, and it would be agreed then. DR reported that there was a draft MoU coming soon. It was reported that another iteration was happening - a new version should go out shortly. 04.07 DK to check that all is up-to-date in terms of GridPP Security Policies - email DB. If there are any issues, DK to let DB know. DK reported that the GridPP Security Policy phase was ongoing at present, however other policies had been approved by LCG. 04.10 SP to contact Robin Tasker re GridMon update (it was noted that this was also a PMB action). SP reported that discussions were ongoing - they had effort at Glasgow to work with Mark Leese on GridMon, but this had been difficult as no response had been received from Mark Leese for some time. ML was currently working on a database. 04.11 Re future network requirements, dTeam to look into iperf as an alternative backup, and report-back. It was felt that this was re-inventing the wheel, but it was ongoing. 05.01 RJ to provide an updated NorthGrid MoU (requires to be modified in relation to EGEE/EGI). 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). 06.01 RJ/Graeme Stewart to provide urls of the place(s) where info is located re ATLAS site tests and measurements (so that sites understand what they're being measured on).