GridPP Deployment Board Minutes 02 - 27th June 2008 =================================================== Present: Steve Lloyd (Chair), Jeremy Coles, Phil Clark, Graeme Stewart, Roger Jones, Alessandra Forti, Pete Watkins, Pete Gronbech, Dave Colling, Andrew Sansum, Derek Ross, James Catmore, Raja Nandakumar, Glenn Patrick, Dave Kelsey, Tony Doyle, Dave Britton Apologies: Duncan Rand, John Walsh, Andy Richards, Stuart Wakefield 1. Minutes of Previous Meeting =============================== The Minutes were accepted. 2. Actions & Matters Arising ============================= 01.1 AS to nominate a Tier-1 Technical representative for the Deployment Board. Done, item closed. 01.2 SL to devise a new ATLAS test. Done, item closed. 01.3 DC to devise a new CMS test. Ongoing. 01.4 RN to devise a new LHCb test. To be advised if feasible. 01.5 SP/SL to look at Metrix 4 x 7 regarding CPU delivered and determine a sensible level that will flag what is obviously wrong. Done, item closed. 01.6 Re the Tier-2 delivering to the LCG MoU (Metric 4 x 11), JC to look into this issue and provide recommendations. This is now Metrix 4 x 12 and recommendations provided. Done, item closed. 01.7 Re network monitoring and quarterly reporting, JC to speak to Mark Leese. Ongoing. 01.8 Draft Policy on killing jobs now adopted as formal - to be uploaded, and reviewed annually by DTeam. TD to send-on the information to SL and he will do a webpage. Done, item closed. 01.9 TD as Technical Director to highlight to Ganga that they should document issues in relation to error messages to users. Ongoing. 01.10 JCatmore to begin to devise a comprehensive list of error messages that would assist users. This takes over from the former PMB action noted below. JCatmore reported that he had been in touch with Johannes over this, a wiki page had been agreed, meantime Ganga version 5 has been released, which has solved some problems but generated more. JCat will be working on this over the next few weeks and will provide links. Ongoing. [277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal with the issue of user experience and design of an easily-found lookup facility for grid error messages. SL reported that he had started the ATLAS wiki page and would circulate the url. SB was leading this with inputs from SP, SL and JC where needed. A new simple summary was required of all areas available plus a lookup/links facility, for the OC to review. This would include a list of most recent types of problems (possibly a 'top 12' for users - what the error means and the course of action to follow). SB to progress this. It was noted that James Catmore (via the DB) had volunteered to do this. This action is therefore transferred to SL for progression via the Deployment Board. Done, item closed.] 01.11 DC to send a list of urls to Stephen Burke relating to CMS analysis on the Grid. It was noted that SB was no longer the contact point for this. Would SB still maintain an error messages list? JC was not sure. DC to email SB to check what he wanted and did he get it? SL noted that this whole area has no owner now. DB advised that it would be discussed at the PMB. Ongoing. 01.12 JC to report-back from UKQCD in relation to what was technically feasible. JC had reported to UKQCD but had received no word back from them until two weeks ago. They are currently trying to implement them on a cluster. Ongoing. 01.13 GP to report-back on the status of MICE. GP has not received any recent report from them, but they have started using resources. AS reported that he had had contact from them - he had requested a clear statement of what was required at the Tier-1 in terms of architecture, but had received no definite response. GP noted that he had asked them to write a brief summary of their intentions. It was noted that we needed better contact with them. Ongoing. 01.14 TD to update the Tier-1 hardware status dates to July (rather than April) to highlight the damage that the funding shortfall is causing. Done, item closed. 01.15 DC to provide updated LondonGrid MoU. Nothing had been provided to SL. This was metric-dependent now. Ongoing. Ditto for 1.16-1.18. 01.16 RJ to provide updated NorthGrid MoU. Nothing had been provided to SL. 01.17 PC to provide updated ScotGrid MoU. Nothing had been provided to SL. 01.18 PW to provide updated SouthGrid MoU. Nothing had been provided to SL. 01.19 TD to send documents relating to the MoU and Tier-1 hardware to SL for uploading to the website. Done, item closed. 01.20 DK to email the DB with all draft security policy documents. DK noted that there had been no new documents generated since the last meeting - some documents would be available shortly. Done, item closed. 01.21 JC to upload talks to the GridPP20 website. Mtg was unsure as to the relevance of this. Item removed. 3. Hardware Resources ====================== 3.1 Accounting Period ---------------------- SL asked about the accounting period - was it 2Q08 to 1Q09, or 3Q08 to 2Q09? One proposal was to start at the original date of 2Q08 and extend it to 2Q09 - this would involve compromise, as 2Q08-2Q09 was 5 quarters. RJ noted that this was unhelpful as we had not yet signed the MoU and it had not been discussed at the NorthGrid Management Board. AF agreed. RJ suggested that once they had the meeting, he would report back and we could start monitoring from when the kit was in place. DB advised that this might cause complications at other places in relation to resources and shares, phasing-out of equipment etc - we would need to know these issues. RJ noted that if we take note of delivery in Q2 then we should not be penalised for delivery before Q3 or Q4, but if there was good performance in Q2 and a 'disaster' later, we should take account of the prior 'good' period. DB suggested taking the best 4 out of 5 quarters? GS noted that this was a bad idea, as we were aiming for 95% availability and only measuring 60% - if we measure 4 quarters from Q4 this year to Q6 next, then if a site does badly, this takes account of previous success. SL noted that a formula was required. JCat noted that it would be preferable if the end date were sooner rather than later. TD agreed, as the end date might become too difficult regarding the next allocation. SL noted that, historically, at the last DB, 3Q08-2Q09 had been changed at the PMB - grants being issued were beyond our control. SL noted also that 4Q08 was also difficult. JCat advised another issue of site diversity - it was difficult for sites if they were sharing a resource and don't need to buy large amounts of kit. TD advised that we should stick with the previous DB decision, despite STFC difficulties - change was difficult in terms of convincing people, unless there were a long-term good reason. SL noted that most of the equipment being counted will not be new equipment - this discussion was being driven by monies added now. DB suggested using 4 out of 5 months, which was 80%, and use earlier performance if required. GS noted a difficulty if a site was working ok and all five months were good - they would lose a quarter. DB noted that if there were scheduled downtime, that quarter did not have to be included. AF noted agreement with 4 out of 5; weighting the quarters was too complicated. TD suggested taking the period 08Q3 to 09Q2, and counting 3 quarters. SL advised that we should count 1Q09 to 2Q09 regardless. TD suggested taking the best of either quarter this year, 08Q3 or 08Q4 - not both - then including 09Q1 and 09Q2 - allocations would be made this time next year for the next period. This was in response to STFC delayed funding and we have to compromise in order to accommodate their difficulties. RJ agreed, noting that this solution tries to take into account sites that would argue in opposite directions. DB advised that Liverpool wanted 08Q2. TD suggested that we could add this. It was agreed that we would take the best single quarter of 2, 3, and 4 in 2008, and the next two - Quarters 1 & 2 - in 2009. DB summarised that this was a good idea, it addressed written objections and was a good compromise. RJ agreed, noting that if it was best out of 3 then the best quarter would be selected by the experiments. 3.2 Metrics to be used ----------------------- JC had circulated an email regarding DTeam meeting conclusions. The points in JC's email were discussed: 1. Wallclock time: TD suggested that we should do what LCG do - what is used to say what is delivered? TD noted that used CPU is what we will be measured by. SL noted that this was in the context of the funding model. DC advised that there was an issue with Monte Carlo vs data analysis. GS pointed out that CPU time is the valid measure for better infrastructure - we need to measure the difference between sites. DC noted that there is no site with poor hardware - we get 99% efficiency with Monte Carlo, 60% efficiency with data analysis. DK asked whether 2 jobs per core would get double wallclock time? DC noted that it would be wallclock time x CPU power. TD advised that we needed to monitor the efficiency separately and see discrepancies - the bigger issue was how we handle data. JC advised that efficiency was by site by VO and there was huge variation. SL suggested that we should go with wallclock and monitor efficiency. TD noted that the allocations are broken-down by experiment; we want the relative values between sites supporting that experiment. SL asked if CPU was being agreed? TD noted that on balance it was ok as is - there was an advantage to local (UK) analysis. There was a discussion regarding disk being made available, and disk actually being used. RJ asked whether sites were actually delivering the stuff that can be used? They can't be penalised for it not being used properly. The discussion ended with (reluctant) agreement to use CPU. 2 & 3 Storage/memory: TD noted that 'usable' disk was more of an issue - in priority terms CPU was first, disk second, I/O third - how do we account for disk and disk used, as opposed to disk available? If a site switches off its disk servers, they are not 'available'. TD suggested that if they were in place, and connected, tested, and could simply be switched on, then they were available. JC should calculate what a non-running disk server accumulates in terms of power? SL summarised by saying that disk 'deployed' and in the system was agreed by the meeting. TD asked the meeting to note that procurement should be on the basis of 2GB per core. 4 Site availability/reliability: TD noted that the procedure was that requests should go to JC for approval. If it has been approved by DTeam, the site is taken out of accounting for that period. SL suggested that if a site is being taken out of the accounting period, it should come to the DB for ratification. It was agreed that JC should report to JC as part of the DB - a specific DB meeting would not be required. GS noted that the last time, we counted VOs equally, and we should be encouraging sites to provide CPU to LHC via a weighting system. RJ noted that it was the PMB who should decide the weights. DB advised that SL has a matrix of site vs VO in relation to allocating Tier-2 hardware resources, this would enable monitoring of delivery - sites should not be penalised for supporting other VOs. TD suggested 100% credit for complying with the allocation, 50% elsewhere, 0% otherwise. SL noted that it would be 100% for ATLAS, CMS etc, less for Biomed. DB noted that we didn't want sites disabling VOs. GS advised that they get credit for running anything - if CMS can't use it and they fill-in with other jobs, they shouldn't get weighting for that. PC noted that this was more about the management of fairshares at sites, and it was dangerous to penalise diversity. GS noted we should avoid sites running other VOs in preference to LHC VOs. It was agreed that the weighting should be 1.0 for LHC; 0.9 for HEP; non-HEP (eg Biomed) would be weighted 0.75. 3.3 Tier-2 Purchasing ---------------------- It was noted that there was an action from the PMB to set up a website to assist with purchasing, giving pointers to spec etc (eg: sites have to have 2GB per core). DC noted that this was not an official CMS or LHCb recommendation. DB noted that this issue had come up at the PMB in relation to disk being 40TB, and there was a need for general guidelines. RJ asked if this was not discussed at HEPSYSMAN? TD noted there had been no generic answers - the Tier-1 is 10TB, Tier-2 was different re Monte Carlo; there was a mixed system of disk and how to manage it. Compromise may be needed as there were embedded issues. It was understood that the figure was 40TB nominally, but sites should be encouraged towards 20TB. SL asked what the forum was for this? Solutions could be collected meantime? TD noted that people were making decisions now and information may not be being shared around - 40TB or 20TB per server depended on I/O rates. SL noted that the people who knew this information were in DTeam. JC noted isses relating to disk servers, memory per core, network links, benchmarking, bandwidth, and suggested a wiki page for recommendations and requirements, to which all could contribute. DB noted that the Deployment Board needed to action DTeam to set this up and advise others to check it before procuring. This was agreed. ACTION 02.1 DTeam to set up a wiki page for recommendations and requirements, to which all could contribute, in relation to such issues as disk servers, memory per core, network links, benchmarking, bandwidth etc in order to assist sites with a more standard procurement - the wiki should be checked before procuring. JC to take the issue to DTeam on behalf of the DB. 4. Deployment Metrics ====================== SL noted that this issue had been discussed at length at Dublin. It was reported that SP, JC and SL had iterated, and metrics were now available. SL asked for comments. Comments should be sent to SP, JC and SL please. JC noted that the issue of the network being ok at sites was still outstanding but he could follow this up with Mark and Robin Tasker. SL would set-up a test. ACTION 02.2 JC to progress the issue of the network at sites, and follow-up with Mark and Robin Tasker. ACTION 02.3 SL to set-up a network test. 5. Deployment Issues from Experiments ====================================== ATLAS: RJ had discussed space tokens at Manchester - the main strategic issue was user space. An agreement had been reached yesterday re space tokens, and a distinction made between 'user space' and 'local space' (20% share for the UK 'local' community). Complaints were expected. RJ would ask sites to clean-up user areas. CMS: DC noted problems at the Tier-1 migration with users, and allocation to the Tier-2s. The grant had come in just in time to place orders for space. DC noted that the UK will be supporting Exotics, Higgs, Susy, E-gamma etc - basically all groups with UK interest. LHCb: RN noted that one major issue was what to do when one user/small group brought a site down, or blocked it. (ie: the situation at RAL 10 days or so ago, which was solved now, but this probably can happen). The question was how to deal with such an issue in the longer term. SL noted that there should be some way of ensuring a site can't be brought down. RN noted that potential problems still exist, and they were struggling to have a foolproof system that was not susceptible to this issue. RN noted that there had been an action relating to LHCb SAM tests - which would give an indication of whether a site was up ok - this was now done. SL asked which of the tests it was? RN said he would email him. SL noted that LHCb didn't seem to be using FCR? RN said that it would be used in DIRAC. Others: GP noted that we had already covered UKQCD and MICE; 'NA48' want to be active at the end of this year - 5TB and 50 CPU; then in 2009 they wanted Tier-1 and Tier-2 Monte Carlo/data processing mix. ALICE have problems with the VO box at Birmingham in relation to security policy and access to single users - they will speak to CERN. BaBar were reducing their Tier-2 requirements of 1200 ksint to 600 ksint and 40TB to 30TB storage at Tier-2. 6. Deployment Issues from Tiers ================================ ScotGrid: GS noted strategic issues: success with ECDF, and they were peaking at 500 job slots for ATLAS production, they were also working with LHCb. Durham had a storage issue but they had a new storage element back online and the site should be back up soon. Glasgow had finalised their tender this week for delivery end Oct '08. JC advised that he had noted an issue re fairshares in the Quarterly Report - what was the fairshare? Was there stability of the job manager at ECDF? GS reported that it was stable at the moment but the SAM tests had failed and version SL4 needed to be tested. There was a move to a more standard job manager. GS further noted an issue with the middleware team - they have set up a share affecting VOs to ensure they get descrimination for Grid jobs. LondonGrid: SL reported that QMUL was back up now. DB noted that QMUL was up and stable for two weeks now. (Duncan, Alex et al had put in a lot of work to achieve this). Re UCL, Jon Butterworth and the Computing Centre had been contacted directly - there was a problem of deployment of new purchases, timescale unknown, but not delivering at present. DC reported that this was in relation to the acceptance of the cluster, they can't have regular access to the test unless they accept the cluster. The issue was converging but was very slow. They have a loss of 30-40% of the cluster. DB asked who had imposed this condition? DC advised that it had been imposed by UCL as 'use' implies 'acceptance'; there were also issues with air conditioning therefore it started badly and got worse. It was noted that they can do the GridPP part of the acceptance test but there were shared memory machine issues etc, and it needs to pass a whole set of procurement tests. Timescale was unknown. DC further noted that the big issue at LondonGrid was effort - there were four open posts: RHUL, QMUL, Brunel, and the priority was to recruit - there were good candidates so far. NorthGrid: AF reported that the main problem at Manchester was dCache not working, the replica manager was working, they were installing a testbed to try on the worker nodes. All other sites seem ok. JC asked about the other issue of Sheffield manpower? AF advised that someone there was now sharing an office and in another building - this was going well so far. SouthGrid: PG noted that all issues from last time had now been sorted out, and Cambridge was successful. Jet have had problems, issues with the CE which were solved by re-installation. Oxford did reconigure to SL4-based CEs therefore the accounting was broken - this would need to be fixed by the end of the Quarter. Bristol had been a success, running jobs on the HPC cluster. SL asked about ATLAS code? PG advised not yet - there were political issues. Birmingham have setup the old e-Science cluster (Midlands eScience Cluster or MESC) to run LCG jobs and the method used is a prototype for sending jobs to the new large HPC cluster (Blue Bear). JC asked about the BaBar cluster at Birmingham? PG noted it was gone, decommissioned. Tier-1: DR noted a major issue with the machine room move - possibly at the beginning of December '08, too risky after that. Random user jobs at Tier-1 were noted as an issue; and the on call system was up and running - they were trying to reduce the number of callouts. Disk server deployment was ongoing. PG asked about storage at Bristol? 100TB was purchased - will they get half of this? Confirmation would need to be sought from Dave Newbold. 7. AOB ======= - JC advised that they needed to take the CIC? on duty and rotate round - there was a shadow Co-ordinator CIC-on-duty distributed to the Tier-2. - It was reported that Will Bell was leaving and Glasgow would need to recruit. ACTIONS AS AT 27.06.08 ====================== 01.3 DC to devise a new CMS test. 01.4 RN to devise a new LHCb test. 01.7 Re network monitoring and quarterly reporting, JC to speak to Mark Leese. JC emailed him but no response. JC will phone (will also cc Pete and Robin Tasker). 01.9 TD as Technical Director to highlight to Ganga that they should document issues in relation to error messages to users. PW noted that they are trying to improve user support with Ganga. PW will follow this up with the help of RJ. [This action was in relation to the general thrust to get error messages onto a webpage.] 01.10 JCatmore to begin to devise a comprehensive list of error messages that would assist users. JCatmore reported that he had been in touch with Johannes over this, a wiki page had been agreed, meantime Ganga version 5 has been released, which has solved some problems but generated more. JCat will be working on this over the next few weeks and will provide links. 01.11 DC to send a list of urls to Stephen Burke relating to CMS analysis on the Grid. It was noted that SB was no longer the contact point for this. Would SB still maintain an error messages list? JC was not sure. DC to email SB to check what he wanted and did he get it? SL noted that this whole area has no owner now. DB advised that it would be discussed at the PMB. 01.12 JC to report-back from UKQCD in relation to what was technically feasible. JC had reported to UKQCD but had received no word back from them until two weeks ago. They are currently trying to implement them on a cluster. 01.13 GP to report-back on the status of MICE. GP has not received any recent report from them, but they have started using resources. AS reported that he had had contact from them - he had requested a clear statement of what was required at the Tier-1 in terms of architecture, but had received no definite response. GP noted that he had asked them to write a brief summary of their intentions. It was noted that we needed better contact with them. 01.15 DC to provide updated LondonGrid MoU. 01.16 RJ to provide updated NorthGrid MoU. 01.17 PC to provide updated ScotGrid MoU. 01.18 PW to provide updated SouthGrid MoU. 02.1 DTeam to set up a wiki page for recommendations and requirements, to which all could contribute, in relation to such issues as disk servers, memory per core, network links, benchmarking, bandwidth etc in order to assist sites with a more standard procurement - the wiki should be checked before procuring. JC to take the issue to DTeam on behalf of the DB. 02.2 JC to progress the issue of the network at sites, and follow-up with Mark and Robin Tasker. 02.3 SL to set-up a network test. There was no other business. The next DB meeting would take place face-to-face at Swansea on Friday 5th September 2008.