GridPP Deployment Board Minutes 05 - 10th September 2009 Cambridge ================================================================== Present: Steve Lloyd (Chair), Jeremy Coles, Phil Clark, Roger Jones, Alessandra Forti, Pete Gronbech, Duncan Rand, Andrew Sansum, Stuart Wakefield, Tony Doyle, Dave Britton, Graeme Stewart, Raja Nandakumar (remote), Glenn Patrick (remote), Dave Kelsey, John Walsh, Pete Watkins, Dave Colling, Derek Ross, James Catmore, (Suzanne Scott, Minutes) In attendance: Sarah Pearce Apologies: Andy Richards, Andrew Sansum 1. Minutes of Previous Meeting =============================== These were approved with no amendments. 2. Actions and Matters Arising =============================== 01.15 DC to provide updated LondonGrid MoU. It was reported that a meeting would be held soon, and it would be agreed then. DR reported that there was a draft MoU coming soon. It was reported that another iteration was happening - a new version should go out shortly. ONGOING. 01.17 PC to provide updated ScotGrid MoU. This had been updated and circulated, it was being sent to Finance Departments. TD advised that the GridPP MoU could be signed off. ScotGrid could sign the GridPP MoU. It was agreed to check the right version was on the website - this was to be downloaded and signed. It should be scanned and sent to SL. Done, item closed. 01.18 PW to provide updated SouthGrid MoU. PG reported that they were changing the figures and this needed to be followed-up. Done, item closed. It was reported that NorthGrid was done but needed modified in relation to EGEE/EGI. New action on RJ to do. ACTION 05.01 RJ to provide an updated NorthGrid MoU (requires to be modified in relation to EGEE/EGI). Re: accounting issues at sites: 04.01 GS to do the same analysis for another month. It was noted that this had been overtaken by events, and had not been done. 04.02 GS to check Manchester and Glasgow figures. GS was unsure what this action related to. 04.03 RN to look into LHCb figures. This was unknown. RN was not present for the moment. 04.04 dTeam to get everyone benchmarked. This had almost been done, 70% was completed. The results would be published on the wiki. 04.05 JC (therefore dTeam) to investigate what other countries do (re CPU shares and priority resources) and come back to the Deployment Board with a proposal. JC reported he did not have much information. DC advised that every country handles this differently, there were no hard and fast rules. SL noted that we had held back 20% of our resources and we had some Tier-3 for local analysis. RD suggested that we prioritise ATLAS UK VOMS roles. GS noted that there was a country flag in Panda - anyone in the ATLAS UK VOMS group could be flagged. SL wasn't sure that was needing to be done here. GS suggested a request to sites to map ATLAS UK users to a different group, they would then get a different share. RJ suggested mapping to 2 groups? GS noted no, not for fairshares - if the priority share was full then they could run a normal share, but jobs must run as a different primary group. DC noted they hadn't discussed it for CMS. RN noted for LHCb, for UK users there were no specific priorities. GS advised they can raise the job priority in Panda. dTeam to try and sort this out, at Glasgow first. ACTION 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). 04.06 JC to report-back on the discussion of sites turning-on their user level accounting. Done - sites had been switched on. 04.07 DK to check that all is up-to-date in terms of GridPP Security Policies - email DB. If there are any issues, DK to let DB know. DK reported that the GridPP Security Policy phase was ongoing at present, however other policies had been approved by LCG. ONGOING. 04.08 TD to discuss organising a storage meeting with Jens Jensen in order to ensure a storage meeting does take place ~June '09. It was reported that this had been organised, and held in July. It had been useful to do. Done, item closed. 04.09 DB to digest the ILC info and circulate, in order to disseminate and to address problems identified. Done, item closed. 04.10 SP to contact Robin Tasker re GridMon update (it was noted that this was also a PMB action). SP reported that discussions were ongoing - they had effort at Glasgow to work with Mark Leese on GridMon, but this had been difficult as no response had been received from Mark Leese for some time. ML was currently working on a database. ONGOING. 04.11 Re future network requirements, dTeam to look into iperf as an alternative backup, and report-back. It was felt that this was re-inventing the wheel, but it was ongoing. There were no other matters arising. 3. Accounting and Benchmarking Issues ====================================== SL advised that we needed to divide up the next trance of hardware money, but parameters were difficult. Although formulas were defined, the HEPSPEC issue had to be taken into account. It was agreed that everyone should get benchmarked. If the log still existed, it was possible to do accounting in the new units, for the accounting period they wanted to use. PG noted that for Durham, they used Q308. Bristol wanted to use a different one, and not all sites wanted to use the last 3 quarters. SL asked whether logs existed everywhere? PG noted no, for the last three quarters probably, but not for Q208, which was better. DC commented that people should not be disadvantaged. SL reminded that at the PMB, decisions had been made regarding this - the following is an extract from the PMB Minutes: ----------------- DC suggested that setting up a sub-group to handle this specific issue would be useful. DB agreed, proposing the four tier-2 co-ordinators plus a couple of senior people (including SL) to moderate - there should be a proposal to go to the Deployment Board. DC proposed the following course of action: 1. specint them all 2. calculate what we can 3. adjust the ones we can't 4. compare the adjustment with those who haven't done it properly 5. if within 10% then ok 6. set-up a sub-group comprising JC, SL and the four Tier-2 Co-ordinators 7. agree timescale Figure was £400k this financial year, from STFC. One month only could be allowed for convergence, as time was short - proposed date was Friday 16 October. SL advised that it should not advantage sites who can't do it. Decisions should be referred to the PMB. This was all agreed. ------------------ It was agreed that SL and JC should go through the process with the sub-group of 4 x Tier-2 Co-ords. GS advised that for Durham, they had estimated the HEPSPEC06 value for the old cluster, they had taken the raw figures from APEL, and using a HEPSPEC estimate they got the figure for Q3. In general, if there were lost logs, people could take the raw CPU from APEL. PG advised that it was simpler to use the measured HEPSPEC06 value, transfer to specint and compare to what was published, and use the new value x hours. SL suggested taking a site that had logs, and applying the method of Durham to the SAM numbers, and see what the difference was. GS compared Glasgow to Durham and noted that there was consistency. They had already done this. DC affirmed that they could trust the nominated group to do the fairest job possible. SL commented that the Co-ordinators needed to be happy about it. CPU was under control however they needed to carry out this exercise over the next few weeks. DK asked whether sites knew to keep logs? PG noted yes, but if a cluster was changed and a new CE installed, you lose the data. SL asked about disk? This was more difficult. The proposal was to use declared disk. DC advised that declared disk was the best that we could do. SL noted that 'available' disk was what was required. PG suggested using the same spreadsheet as last time. There was an estimated figure of cost and disk. DB noted that hardware numbers were maintained for planning purposes by him - but it was still a guess, they will have changed from SL's spreadsheet. SL commented that we needed to know a) the money we had, b) the kit we wanted, c) the price per kit estimate - these were the three figures required. TD noted that we had stated on the MoU that we would do a review at this point. SL advised that last time the numbers were multiplied by the SAM test result - we wouldn't be doing that this time. TD noted that there should be an obligation on sites to be available and deliver - it was the best global metric. ATLAS and CMS could give a different metric - it was up to DB to determine what that should be, and if it involved a change from the MoU then we needed to declare it. SL proposed not to multiply by the SAM tests. SL also noted the matrix of sites vs experiments - last time the experiments were marked for Institutes. There was a discussion as to whether experiments should populate this matrix, rather than SL. RJ noted that, in the same spirit as SL's tests, the issue of what we have at the moment captures experiment needs - but there was need to get approval. SL suggested that this would give a moral obligation for sites to support experiments they weren't on. PG cited political implications of this for funding of posts. SL noted it was up to the experiments - it would be an overlay of 0s and 1s. TD suggested that SL give the current matrix to DC, RJ, and RN, and they say 'yes' or 'no' - this was a sensible default. DC noted however that it might not be restricted to 0s and 1s, eg: if they bought a fraction of RHUL and if they used 25%, does that equate to a share of 'x' on the resources? PG advised that resources at sites were bigger than GridPP will have paid for - they need to be careful about giving a percentage of CMS funds then want a 25% share. TD advised that the rate was the going rate for CMS - it was the allocation that was made - there was a sum total that comprised the CMS requirement. DC suggested that SL send the spreadsheet round with the matrix in - they could all see what the numbers were, and what was reasonable. RJ suggested that this was not that different in principle to what they did last time with the Tier-2s. SL noted however that this was a redistribution across the Tiers. DB advised that both experiments and sites needed to be happy, and agree, but the issue should be experiment-driven. It was proposed that SL provide the numbers and get them ratified by the experiments - this was the way to go forward. DB noted that the right numbers of sites was less than we currently had. SL agreed that for GridPP4 we needed to do things differently. DC commented that there were sites that had done well, and had received very little money - they would receive money this time round. DC suggested that we see the figures first, then negotiate. SL advised that he would need the numbers from sites over the next few weeks. DB reported that wLCG needed the information by 28th September, so DB needed to attend to this now - he had to finalise things within the next week. SL suggested 2nd October for a meeting of the sub-group? A deadline date had to be set for provision of the numbers. SP wanted to check that what we were doing was giving experiments 'carte blanche' on how to spend money, with the agreement at sites? DB noted that no, this did not mean carte blanche - we present the numbers and they will moderate/adjust as required, but without major changes. It was agreed that the sub-group, comprising SL, JC, and the four x Tier-2 Co-ordinators, would meet on 2nd October to discuss the figures and follow the action plan as agreed (see points 1-7 above, with the note following). ACTION 05.03 The Accounting & Benchmarking sub-group, comprising SL, JC, and the four x Tier-2 Co-ordinators, would meet on 2nd October to discuss the figures and follow the action plan as outlined above (see points 1-7 reproduced below, with the note following). ----------------- DC suggested that setting up a sub-group to handle this specific issue would be useful. DB agreed, proposing the four tier-2 co-ordinators plus a couple of senior people (including SL) to moderate - there should be a proposal to go to the Deployment Board. DC proposed the following course of action: 1. specint them all 2. calculate what we can 3. adjust the ones we can't 4. compare the adjustment with those who haven't done it properly 5. if within 10% then ok 6. set-up a sub-group comprising JC, SL and the four Tier-2 Co-ordinators 7. agree timescale Figure was £400k this financial year, from STFC. One month only could be allowed for convergence, as time was short - proposed date was Friday 16 October. SL advised that it should not advantage sites who can't do it. Decisions should be referred to the PMB. This was all agreed. ------------------ 4. SL5 Deployment ================== SL asked if there was a summary of the current situation? GS reported that ATLAS were happy to move to SL5 with gLExec disabled. TD noted that they wished to retain a fraction of SL4. GS confirmed they would retain 100 cores on SL4 at Glasgow. PG advised that most sites were now changing. DR noted that most sites will not leave SL4, but would test SL5 first, then move eventually. GS reported that there were problems with the older releases, 12 didn't work, 13 could work on SL5. James Catmore noted that most users used lxplus to start with and maintained a small number of cores - standard lxplus would point at 5 from October. [Subsequently this was cancelled]. GS reported that at Glasgow next week they would swap over, and Durham would move mid-October. In the current plan, ATLAS would move to SL5 build kits by the New Year. SW advised that CMS would lose some of their build machines - they may have to move over to SL5 build machines as soon as possible. DC advised that they didn't want to support SL4 builds for a long period. SW noted that in anycase it didn't work with DPM. DB read out the 'official' CMS position statement on SL4 up to the end of 2009. There was a general discussion about the technical issues involved in releases, and the use of SL5, also timescales involved. 5. Deployment Issues from the Experiments ========================================== ATLAS ----- - RJ reported that in relation to calibration data for ATLAS, they were talking to the Tier-1 about this, and were asking ATLAS sites to deploy a squid cache. - SL5 was also an issue at present. - they were hoping to exercise data movement before first data. RJ noted that these were all important issues, but were not showstoppers. JC reported that the collaboration expected basic plots to be ready quickly at the Tier-2s. CMS --- - SW reported that high priority for them was the role for the Tier-2, at Brunel, Imperial etc. - they were reserving Tier-3 capacity. - reliability was an issue. LHCb ---- - RN reported that nothing major was occurring - they were waiting for deployment of new disk, many other sites had run out of disk space. - storage at the Tier-1 was a concern. - re SL5, they plan to move by mid-next week and be fully certified. GS noted that ATLAS wanted a clean software area for SL5, but he knew this was not the case for CMS - what about LHCb? RN noted not yet for LHCb - SL4 was running in compatibility mode, they would leave the software as it was at present. Other ----- - GP reported no major issues, they still had continued usage from MINOS; ILC was active again; MICE were on the horizon - but there were no big deployment issues at present. SL commented that 'other' experiments were now seriously using the resources - things had become easier for them. GP agreed, noting much progress from a year ago. DC reported that he had attended a MICE meeting with Janusz Martyniak to discuss the computing model. GP noted no SL5 issues generally. 6. Deployment Issues from the Tiers ==================================== London ------ - DR reported that for Brunel, they were moving into a new data centre, hopefully around 6 weeks' time. - LeSC had now closed down. - IC-Hep had upgraded their network switches to 10GB ram; they were advertising for a new SysAdmin. - QMUL had moved to Storm and Lustre, there was the possibility of a WAN upgrade. - RHUL were looking to move their cluster but there were networking issues, the new date was the New Year. - At UCL they were discussing rationalising the sites there. TD asked whether, once the allocations were done, we could put forward recommendations about how sites deploy SSDs? Oxford were keen to do SSD in different configurations on behalf of GridPP - they could be given money in advance of the allocation. SSDs were worthy technology but did we want to deploy them on Worker Nodes? PG said no, it was too expensive. There was a discussion about the possibility of this, and testing. TD noted that the proposal at the end was to produce a proposal document for different strategies. SW asked whether GridPP would get together with a vendor to test? TD advised that at the level of ~£2k it was not an issue; and they could also slow things down. SL noted that it was the wrong time to do this. SP reported that an estimate of £400k was asked for this FY out of the Tier-2 money. TD observed that we could make an allocation of £25k for such a purchase - a document would come from Jens Jensen and the Storage Team by December, giving recommendations. PG stated that it would be good to do the test, but a specialised Worker Node with SSD increased the cost. DC asked if the cost were roughly £100 per slot? TD asked whether it was a sufficiently large cache? PG advised that if the cost and complexity were both increased, it might not work very well. DR advised that we needed to think about what we were trying to optimise - was it analysis efficiency? Job reliability? SL asked whether a proposal would come? TD noted yes - the request would come from the Storage Team to the PMB. NorthGrid --------- - AF noted that there weren't many issues to report. Liverpool were doing well, but the kit was getting older and there were problems with the water cooling, however it was in check at present. - Sheffield were doing well despite their size, they were involved with ATLAS. - Lancaster were awaiting new kit/new machine room; they were doing reconfiguration with the new switch to improve throughput. - Manchester had reduced storage but were hoping to tender shortly. - All sites were moving to SL5 in October. RJ reported on a longstanding issue of support at Liverpool - they extend support to ATLAS from Lancaster. They were starting a tender for kit next Spring, £1.2 million was going into the new machine room. SP noted that Management meetings were problematic - they had failed the metric most quarters. There was a discussion about shared clusters and the pros and cons. TD asked about the memory requirement for the next procurement? We needed to give a recommendation. DC advised that 1.5gigs per core was the recommendation - they reviewed it within the last 6 weeks - 2gig per core was still the CMS value, they specified memory per job slot rather than per core. TD noted that it was a good general recommendation, then, memory per job slot rather than per core. ScotGrid -------- - GS reported that he needed to think about what they wanted from Durham and how to use it. TD noted that Phenogrid was a substantial use of Durham, did we acknowledge this? SL noted yes, it was worth 90% in the formula. - GS reported that ECDF was now working well, it was the most successful of shared clusters to date, until the money ran out. PC advised that after the next hardware funding round they could determine the fairshare. GS noted that GridPP had to pay for what it used. DC asked how it compared to, say, the Glasgow cost? PC noted that it was about twice the cost, but next time there will be more CPU power. SL advised that it was the real cost, all others were subsidised. PC noted that we would need to come up with some other metric. ECDF was more heavily used than RAL at the moment. GS advised that the Storage Group were thinking about trial deployment of ARC in Scotland, possibly using Glasgow as a primary repository of data - they had contacts in NorduGrid who were keen to help. They could do 'proof of concept' at Glasgow, then if this was ok, possibly at Durham. DR asked how many sites in the UK would it be applicable to? GS answered he didn't know yet, he would need to do the Technical Report and Recommendations. PC advised that we would need the benchmarking numbers by the 28th (all the Tier-2 have to do this). TD noted we also had to synchronise the HEPSPEC publishing. PG noted that we need to suggest a date when we publish the new values - from 1st October, for example, everyone has to use HEPSPEC06. ** It was agreed that 1 core was equivalent to 8 HEPSPEC06 ** ACTION 05.04 dTeam to publicise the 1st October as the changeover date to HEPSPEC06. 05.05 DB to co-ordinate the pledge report to wLCG once all is agreed. It was noted that for 2010 the date would revert to April again. SouthGrid --------- - PG reported that sites were relatively stable during the period; Oxford now has dual links. - for Bristol, PW noted that Nick comes to management meetings, there has been slow progress, they have increased their fraction but they are a small contributory only. Tier-1 ------ - DR reported that they were deploying Quattor incrementally. - there had been a funding call to port the Aquamon database. - SL5 migration would start on Monday. - there were disk hardware issues which were ongoing. - the PQQ for the next round had been issued; they were testing streaming. - CASTOR was stable, the BIG ID problem was now fixed, 2.1.8 issue was ongoing TD asked about recommendations for hardware - was that channel to the Tier-2 still open? PG reported that things do trickle down via Martin Bly. JC commented that they were blogging more now. TD asked if there were any issues? DC confirmed no. PG noted that you can tell from the Hepix talks too - information was available. TD asked if there was any advice about hyperthreading from the Tier-1? DR noted no, not as far as he knew. There was a discussion on hyperthreading issues. It was understood that sometimes the Tier-1 was caught out with issues and this also affected those who followed Tier-1 advice. TD asked about stacked network recommendations? PG reported that a lot of people follow the Tier-1 and go with Nortel, but no particular issues had arisen. 7. AOB ======= PW asked about Tier-2 and local usage issues? RJ noted this was a Tier-3 issue. TD noted that 80% of what was in the MoU was committed to LCG. 20% was held behind in principle. PW was talking about UK analysis. RJ noted that they prioritised UK access. AF advised that at Manchester, users ask for storage (for Tier-3). RJ noted that in ATLAS and CMS, stuff for the UK was Tier-3. SL advised that we needed to define this on the GridPP website, for space tokens etc (UK usage of resources). SL noted that a one-page explanation would be useful. TD added, or pointer to the experiment pages. ACTION 05.06 RJ, DC, and RN to provide to Neasan O'Neill a link to each of the experiment's information on UK usage of resources. These links to be added to the GridPP website by Neasan O'Neill. ** It was agreed to note VALUES: 12 HEPSPEC equated to 1TB 1 ksint2000 to one-third TB was the OLD BALANCE NOW: 1 ksint2000 equated to 4 HEPSPEC 4 HEPSPEC equated to one-third TB 12 HEPSPEC equated to 1TB There was no other business. ACTIONS AS AT 10.09.09 ====================== 01.15 DC to provide updated LondonGrid MoU. It was reported that a meeting would be held soon, and it would be agreed then. DR reported that there was a draft MoU coming soon. It was reported that another iteration was happening - a new version should go out shortly. Re: accounting issues at sites: 04.03 RN to look into LHCb figures. 04.07 DK to check that all is up-to-date in terms of GridPP Security Policies - email DB. If there are any issues, DK to let DB know. DK reported that the GridPP Security Policy phase was ongoing at present, however other policies had been approved by LCG. 04.10 SP to contact Robin Tasker re GridMon update (it was noted that this was also a PMB action). SP reported that discussions were ongoing - they had effort at Glasgow to work with Mark Leese on GridMon, but this had been difficult as no response had been received from Mark Leese for some time. ML was currently working on a database. 04.11 Re future network requirements, dTeam to look into iperf as an alternative backup, and report-back. It was felt that this was re-inventing the wheel, but it was ongoing. 05.01 RJ to provide an updated NorthGrid MoU (requires to be modified in relation to EGEE/EGI). 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). 05.03 The Accounting & Benchmarking sub-group, comprising SL, JC, and the four x Tier-2 Co-ordinators, would meet on 2nd October to discuss the figures and follow the action plan as outlined above (see points 1-7 reproduced below, with the note following). ----------------- DC suggested that setting up a sub-group to handle this specific issue would be useful. DB agreed, proposing the four tier-2 co-ordinators plus a couple of senior people (including SL) to moderate - there should be a proposal to go to the Deployment Board. DC proposed the following course of action: 1. specint them all 2. calculate what we can 3. adjust the ones we can't 4. compare the adjustment with those who haven't done it properly 5. if within 10% then ok 6. set-up a sub-group comprising JC, SL and the four Tier-2 Co-ordinators 7. agree timescale Figure was £400k this financial year, from STFC. One month only could be allowed for convergence, as time was short - proposed date was Friday 16 October. SL advised that it should not advantage sites who can't do it. Decisions should be referred to the PMB. This was all agreed. ------------------ 05.04 dTeam to publicise the 1st October as the changeover date to HEPSPEC06. 05.05 DB to co-ordinate the pledge report to wLCG once all is agreed. 05.06 RJ, DC, and RN to provide to Neasan O'Neill a link to each of the experiment's information on UK user help. These links to be added to the GridPP website by Neasan O'Neill.