GridPP Deployment Board Minutes 03 - 5th September 2008 ======================================================= Present: Steve Lloyd (Chair), Jeremy Coles, Phil Clark, Roger Jones, Alessandra Forti, Pete Watkins, Pete Gronbech, Dave Colling, Duncan Rand, Andrew Sansum, Derek Ross, James Catmore, Stuart Wakefield, Tony Doyle, Dave Britton, Suzanne Scott (Minutes) In attendance: Sarah Pearce Apologies: Graeme Stewart, Raja Nandakumar, Glenn Patrick, Dave Kelsey, John Walsh, Andy Richards 1. Minutes of Previous Meeting =============================== The Minutes were accepted. 2. Actions & Matters Arising ============================= 01.3 DC to devise a new CMS test. Done - Chris Brew send SAM test links to SL 01.4 RN to devise a new LHCb test. Done - RN send SAM test links to SL 01.7 Re network monitoring and quarterly reporting, JC to speak to Mark Leese. JC emailed him but no response. JC will phone (will also cc Pete and Robin Tasker). JC phoned, but no response. Done, item closed. 01.9 TD as Technical Director to highlight to Ganga that they should document issues in relation to error messages to users. PW noted that they are trying to improve user support with Ganga. PW will follow this up with the help of RJ. [This action was in relation to the general thrust to get error messages onto a webpage.] This is now covered by 01.10. 01.10 JCatmore to begin to devise a comprehensive list of error messages that would assist users. JCatmore reported that he had been in touch with Johannes over this, a wiki page had been agreed, meantime Ganga version 5 has been released, which has solved some problems but generated more. JCatmore will be working on this over the next few weeks and will provide links. This was happening, but slowly. Ongoing. 01.11 DC to send a list of urls to Stephen Burke relating to CMS analysis on the Grid. It was noted that SB was no longer the contact point for this. Would SB still maintain an error messages list? JC was not sure. DC to email SB to check what he wanted and did he get it? SL noted that this whole area has no owner now. DB advised that it would be discussed at the PMB. GP had spoken to SB. This was to be followed-up by the PMB. Done, item closed. 01.12 JC to report-back from UKQCD in relation to what was technically feasible. JC had reported to UKQCD but had received no word back from them until two weeks ago. They are currently trying to implement them on a cluster. The current issue was that they wished to do something with ILDG VOMS but there was no further info. Done, item closed. 01.13 GP to report-back on the status of MICE. GP has not received any recent report from them, but they have started using resources. AS reported that he had had contact from them - he had requested a clear statement of what was required at the Tier-1 in terms of architecture, but had received no definite response. GP noted that he had asked them to write a brief summary of their intentions. It was noted that we needed better contact with them. DC noted that he had contact with them, but the situation wasn't good at present. Ongoing? 01.15 DC to provide updated LondonGrid MoU. It was reported that a meeting would be held soon, and it would be agreed then. 01.16 RJ to provide updated NorthGrid MoU. It was reported that the final copy was not yet signed. They had changed the dates last time, on the DB pages, the change had been accepted by NorthGrid. 01.17 PC to provide updated ScotGrid MoU. This had been updated and circulated, it was being sent to Finance Departments. TD advised that the GridPP MoU could be signed off. ScotGrid could sign the GridPP MoU. It was agreed to check the right version was on the website - this was to be downloaded and signed. It should be scanned and sent to SL. 01.18 PW to provide updated SouthGrid MoU. Ongoing. 02.1 DTeam to set up a wiki page for recommendations and requirements, to which all could contribute, in relation to such issues as disk servers, memory per core, network links, benchmarking, bandwidth etc in order to assist sites with a more standard procurement - the wiki should be checked before procuring. JC to take the issue to DTeam on behalf of the DB. JC reported that this was now set-up OK. Done, item closed. 02.2 JC to progress the issue of the network at sites, and follow-up with Mark and Robin Tasker. Done, item closed. 02.3 SL to set-up a network test. Done, item closed. 3. UK Infrastructure ===================== SL advised that we need to look at the UK infrastructure and check status to see if we need anything more, or anything needs addressed. a) CA ------ It was noted that this had been discussed at the PMB. DB noted problems with the CA recently, he had contacted NGS and had received a detailed reply giving timeline info. They stated that the root causes were not their fault but it was understood that issues could have been handled better. They were setting up a Technical Advisory Group (TAG) with 'ambassadors' and technical advisers. After discussion at the PMB it had been agreed that GP would be the users' ambassador, and JC would be the technical representative on the group (dTeam will receive info from this contact). If this was agreed here, DB would contact Andy Richards. DB suggested that this was a positive move on the part of NGS - the service does answer to clients. SL asked if the meeting was happy about this? The DB agreed yes. b) RB/WMS ---------- SL asked if this was now under control? DC advised that we can replicate it OK if another were needed. He would strongly encourage people to move to WMS. AS noted there needed to be a countdown to closure of the RBS plus instructions for users. DC advised writing an email and letting users know. JC noted that it was down to the dTeam to sort out the configurations. c) LFC ------- Re the LFC, SL asked whether there was much to do? AS reported that the problem with the LHCb one was now solved, did ATLAS need a 2nd replica? RJ noted that regarding disaster recovery they needed to be able to restore to tapes. AS advised the new machine room might help, and there would be a cost relating to the Oracle licences, he would get solutions and options, and also look at prices - the end of the year would be realistic, he will speak to JG. ACTION: 03.01 AS to investigate LFC recovery for ATLAS. d) BDII -------- SL noted that we had two - one at ScotGrid and one at RAL. Was that enough? TD observed that these were fairly trivial to set up - would AF be willing to set one up? TD noted that this needed to be done at Regional Tier-2 level. SL suggested that we should set up 2 or 3, possibly more if required. This was agreed. e) UIs ------- JC reported that every site had a UI but which one was the user using? TD advised that what we want is a test to look up the UI instances and ask whether it was the latest version - was that possible with Ganga? JCatmore noted that the UI at Lancaster was on a cluster, which had to be authenticated, so how could you test that? SL advised that most users use LXplus - once the beam starts LXplus will grind to a halt. AF advised that the problem was not the UI, experiment software was better at CERN. That's why people use lxplus to compile/test and submit. JCatmore noted that it wasn't really a site thing - people work on a local cluster therefore software needed to be installed on a cluster but needed to be kept up-to-date. SL reported that you couldn't probe these from the outside. PC noted that Institutes will have different copies. SL advised that experiment reps should be contacting the responsible people at Institutes and warn them what is likely to happen, and advising being prepared. SL further noted that when LXplus slows down, it is then harder to set anything else up. UIs were probably OK for experiment software but suitable arrangements needed to be in place. f) COD ------- It was noted that JC should ensure we know what to do regarding the regional CIC-on-duty work - this was an EGEE responsibility and we need to get people trained on how to use the CIC dashboards, tracking tickets, dealing with alarms etc. AS advised that this required half effort of each of two people, once every six weeks. TD asked if the Tier-2 Co-ordinators could contribute? JC advised that they take a turn as regional ops centre. SL asked if we needed to do it full-time for our own regions? JC advised also re Nagios at sites and alarms being raised if there is a problem. PG noted that the Tier-2 Co-ords object to doing this. SL asked what the load was for one person? DR advised two people's time, 50% out of a week, one week in six, doing the whole Grid. TD noted that we need to do it on a local basis all the time. SL said this equated to full-time coverage then? TD advised one week in five, this was part of SA1 and Federation commitment tasks. PG noted that the background task needs to be looked at every half hour. SL noted that this would take out one tech co-ord. JC advised that he would prefer to automate it via Nagios, if on call, people have to raise tickets. DR advised that the dashboard shows failures, which raises tickets, and tickets were based on SAM test failures. AS noted that it was labour-intensive to check the dashboard. SL suggested that it was down to the dTeam to sort this out. TD advised that it was supposed to have started in May. SL asked if there were any other common infrastructure things missing? Meeting advised no, not that they were aware of. 4. Metrics =========== SP reported that there were two metrics which she doesn't have measures for yet, and this needed to be discussed. 4x12 - MoU commitments - they have 3 components: 1. provide resources - CPU, storage; 2. service required; 3. manpower (effort). SP thought we should measure this somehow. JC noted GGUS helpdesk changes - assignment of tickets to time of first response, this was not yet implemented. DB noted we could look at blacklisting, uncertification? SP noted that this doesn't tell us the same thing. SL observed that no information was given on tickets, no history, there used to be an appended message and it would be good to have what the ticket referred to - they have less and less useful information on them - relevant info should be on the email on the page. TD agreed that this would help with first level response. TD noted that this was in the MoU - required response times - but so long as we're reliable and available overall then surely it was OK? AS advised that responding to tickets also related to communication, it wasn't just about fixing a problem. SL noted that we couldn't have it as an operational measure if we couldn't measure it - we should leave it as unmeasurable until they come up with a measure and then we can see how we respond. This was agreed. There was a discussion about 'freedom of choice' tool and sites being blacklisted. SL observed that we can measure how long sites stay red. AS noted that this did not help if someone responds to a ticket within half-an-hour and hasn't been able to fix the problem. RJ advised that experiments will generally inform sites that they're not getting responses from them. 4x14 - upgrades - SP noted that the conclusion last time was '100% of sites in Tier-2s upgrade to the timetable agreed by the DB', but we have no timetable. TD advised that this had happened recently re the upgrade from SL3 to SL4. SL asked if there was a list of what people should be on that could be checked against? In the Quarterly Report? TD advised that storage would be best to monitor, e.g. the DPM plugin. DC noted that storage upgrades had happened well during CCRC'08. DR advised that for SEs, Greig Cowan had a page with versions. JC observed that it would be a simple yes or no answer. SL suggested that if we got 15 out of 17 would we be able to give that kind of number to SP? SP advised yes, but she didn't know how often the list was changed - we could try it for this quarter and see how it goes. This was agreed. SP advised that the only other issue was % of disk used. TD noted this was measurable but it was the interpretation that was difficult - the measure at the top end was to do with VO ability to manage the space, which was different to underutilisation of a site. PG noted that depended on whether the site was providing what it was supposed to. SP suggested that we could measure the % over which the site is meant to be full. SL noted that this would therefore go red over 20%? SL further commented that if the site had 100TB and was broken, and only 1TB was used, what then? TD advised 'mean' utilisation of the sum of Tier-2 disk, which would give an integrated sum. SL noted therefore a lower limit of 20 and no upper limit? RJ advised that 96% occupancy of disk was something to be worried about. SL noted that as an operational issue - therefore only looking at the bottom level. DB observed that we hadn't been anywhere near this problem, and suggested that we not worry too much about it until then - fullness of disk was monitored in a VO context, not a site context. TD also noted that it was a question of inconsistency of the computing model vs practise. SL suggested a lower limit of 20% and no upper limit. This was agreed. 5. Deployment Issues from Experiments ====================================== ATLAS ----- RJ advised that the critical issue at the Tier-1 was CASTOR. Regarding the Tier-2, they need to get co-operation from sites re the use of storage by users, they need help to clear up dark data, and they need to clean up the scratch and group space. SL asked if there was a mechanism to do this? RJ advised communication through Operations and Jamborees. SL asked whether these methods reached those ATLAS needed to reach? RJ confirmed yes, there was no problem with communication, effort, and response. PG and AF noted that Glasgow were using a different cloud, and this affected both accounting and money. SL advised that to some extent the current algorithm counteracts that since sites are able to choose their best quarter from three this year. There was a discussion of testing, clouds, and accounting. JCatmore noted that for first data, users would be able to submit jobs, it would be nice to know how to delete files or datasets if they were not required, from both catalogue and disk. It was agreed that overall we won't know if this is a big problem until startup, but it was felt it would be generally OK. CMS --- SW wished to remind sites of Tier-2 importance - there were unique copies of data only at two or three of the Tier-2s, and when they go into downtime and don't tell everyone, it causes problems. SW understood that sites had to go down, but noted that users needed adequate notice of impending downtime, and asking CMS in particular, whether any particular time would be convenient, would be good for CMS operations. DC noted that CMS were deciding what the Tier-2 actually was: 3 sites: 2 of LondonGrid and 1 of SouthGrid, however Bristol also had a Tier-2 label but no functionality. Brunel, Imperial and RAL PPD going down with no notice was not good. SW advised that data will be with a US site, also Asia, but it would be nice if the site gave advance notice to enable them to move data if they had to. SL asked if this could be done at dTeam? DR commented that WLCG stipulates downtime notice. DB asked how experiments should expect to get that warning? SW noted that it could be mailed to anyone in CMS UK. DB asked how downtime advance notice was currently announced? JC advised via the GOCDB, and a notice was generated. DR observed that sites like NorthGrid, ScotGrid, nominally support CMS but are not used by CMS. CMS tend to ignore such messages as 'noise'. CMS need to target their sites with broadcasts. JC noted that the method in the broadcast system was to subscribe to specific sites. DR noted that they wanted improved communication. AS advised that the broadcasts generally didn't go to those who wanted them. DR advised that if this was highlighted to SysAdmins at sites they can forward broadcasts to CMS lists and inform users what they are doing. AS suggested that it wasn't as simple as that - the Tier-1 would probably need a broadcast tool. SL advised that information has to be put somewhere and those who want it should subscribe to it. JC advised that the broadcast tool didn't work. SW noted that he just wanted to try and get info - the Tier-2 needed to move to a service level, but it was important to let CMS know as it affected a large community, especially if there were likely to be a long downtime. JC advised that LHCb have a calendar view of downtime, which they found useful. DC noted that they were generally ready as they could be, and were reasonably optimistic. ACTION: 03.02 JC to follow-up on the LHCb calendar view of downtime, to see if it would be useful elsewhere. LHCb ---- PC noted that production was ramping-up, DIRAC3 was being commissioned, they were hopeful for improvement, although Monte Carlo was some way off. They were targeting UK sites re user analysis, Edinburgh was hosting a lot of LHCb data. Others ------ JC reported that T2K wanted to be enabled on an LFC somewhere. DC reported that at Imperial they had an LFC set up for small experiments - it was a reasonably good service. SuperNemo and MICE had asked, and something had been set up for them. JC noted that he would advise T2K to use Imperial. PC asked if the LHCb model was not right at that time, what do they do? RJ advised that they should rebalance. TD noted there should be redistribution within LHCb. DB advised that we should be flexible enough to accommodate needs. There was a discussion on fairshares and allocation to small experiments. GP noted that he allocates according to requirements and constraints, and there were allocations per quarter at Tier-1 and Tier-2. JC asked whether we were happy that the fairshares at sites were being set up OK? PG noted that this had not been an issue at sites to date. SL advised that the mechanism for addressing this would be through the Technical Co-ordinators. 6. Deployment Issues from Tiers ================================ SL asked whether there were any issues to be raised/actioned? LondonGrid ---------- DC noted that UCL was pending, QM had manpower issues. DC didn't know what UCL were publishing. DR advised that they had the same problem as ECDF - one queue for Grid jobs but local jobs as well - the issue was about fairshares. AS asked why weren't they publishing what was available? DR advised that they were in downtime due to commissioning - they were doing Grid-based acceptance tests. SL noted that the issue was the large red GridMap box. DR suggested that it was what the local BDII was publishing. SL noted they needed to know how much resource should be available. PG advised there were two issues - one was GridMap, and the other was the Quarterly Reports. DR said he would try and sort something out. DB asked about RHUL? DR reported that they were waiting on the sysadmin to come back from holiday, but were about ready. DB advised that Janet Seed had emailed and informed that if they don't use the funding very soon, they will lose it. If they can't get the recruitment agreed, they will need to take the funding as hardware. DC advised that there were estates charging issues relating to someone being employed at less than 100%FTE - DC was not in a position to influence the discussions. DR advised that at QM an advert was due to go out. NorthGrid --------- RJ reported that there was a big issue at the end of two years re posts, other than that, not much to raise - things were going fairly well in general. RJ noted that storage was moving to DPM. AF advised that dCache would remain for minor VOs. ScotGrid -------- PC reported that the Glasgow procurement had gone well; the Durham cluster was installed; at Edinburgh new storage had been allocated, the main issue was reliability, around 80% which was OK, to get 90% they would need 7 days' coverage. PC reported that manpower was OK but Glasgow was down to 1 person, recruitment for 2 posts was in hand; they had moved someone to full-time for 9 months to cover the gap meantime. SouthGrid --------- PG reported that the RALPPD kit was OK and good staff were in place. Re Oxford, new kit was coming; they were having difficulty recruiting a Tier-2 Co-ordinator. At Birmingham new upgrades for hardware were coming online. At Bristol a small fraction of the HPC cluster was available, and they were in a new phase of running SL5. SL asked about ATLAS? PG noted hopefully yes, at the moment it was CMS and GridPP VOs, they were hoping to expand to ATLAS. Jet have run jobs for ATLAS in the past, but when they went down, it dragged the availability figures down. SL asked whether they should be included or not? PG noted not for funding on the spreadsheets, but they do contribute to SouthGrid generally. There was a discussion of collaboration between Institutions and within Institutions re University clusters and SRIF3 funding. TD advised that if they could devolve MC production on to a central cluster then that would be better, it would provide separation of different classes of jobs - but this had implications for funding and devolving to central computing. JC pointed out the metric about disk promised and disk used, due to slow funding this should be raised with STFC next time as to why the percentages are down. TD noted that the metric was 'disk available'. SL noted that we've never met the MoU requirement for disk, but it was going slowly green. Tier-1 ------ AS reported that: Deployment/staffing levels was a big issue - they were six staff down at the beginning of GridPP3. Recruitments were underway which affected the fabric team particularly. CASTOR downtime was another big issue - hopefully this was close to being solved. Communication was still an issue - we need to think about back-out strategies more in relation to disaster management, i.e.: the Tier-1 inoperable for ~n days, which would need a managed escalation via levels of management. This was about managing changes better, and managing incidents better. Disk deployment was difficult at present due to the number of disk servers and the fabric team situation - they had a consultant in to look at managing the multi-cpu processors in order to help track issues. Downtimes at the Tier-1 have different frequencies, we need to think about handling this in future - it is perceived to have too many planned downtimes and we can't have the frequency of downtimes that we have at present. SL noted this should be scheduled around data-taking. AS asked between runs? RJ advised maybe a weekend would be available now and again, but for the next few months, no. AS noted that the last delivery of hardware was operating well. The next procurements were underway. They needed to make the LFC and FTS more resilient. Migration was an issue - the building, originally scheduled for the end of August, has now drifted to 1st December hand-over - AS advised that JG was project supervisor and there was a 'Building Projects Group' in operation, plus the machine room manager was also dealing with things. DB asked who was co-ordinating the network etc? AS noted the machine room manager. AS reported that the machine room and ops group meets monthly, which AS attends. The detailed layout plan, cooling estimates and power plans were in hand. DB asked whether personnel were booked in at the due time for installation? Would they be available at short notice if it was later than envisaged? AS advised that the power, cooling, and network were the main issues - the network cabling would go in two weeks after completion, which took us up to delivery point. The feedback he had at the moment was that delivery was on time. AS advised that there was a contingency plan in the ATLAS machine room; He was working on a plan to hand-off to GridPP to look at - to move all the kit from 2005 onwards (to include all staff in the new building), the tender run for this was being started at the end of this month, to move around 20 racks. To move the Tier-1 would be during the two weeks when the disk is moved, the CASTOR core will be down. They were being forced to move before April '09 which was fiscal cut-off. There is a date beyond which they don't start at all. RJ advised that raw data needed to be moved. He noted that experiments could not give a hard and fast schedule - they would respond to requirements weekly. TD advised that ATLAS could decide not to send the data to RAL and not do the reprocessing there. SL summarised that we await the final document from AS, to be circulated to the PMB. PW asked a question about on call? AS noted numbers over 10 now - some people were doing primary and secondary functions. 7. Security Issues =================== SL asked if there was anything to discuss or bring up for attention? It was noted that AS line-manages the Security Officer - how was that working with sites? SL asked if the Co-ordinators had any comments? DR noted that he had done a good job - there were two mail lists on the GOCDB and the Security officer used one of them. PG advised that the recent incident was handled better than the last one. SL observed that 'site security officer' was not always easy to understand who one meant? PG noted that we need also to send to University Security people and Site Admins. JC added that this was also a weekend cover issue - we needed to review what was in the GOCDB. TD advised that a proposal could be endorsed to let the Security Officer go in and try and 'hack' the system to check internal security? AS advised that this was not the time to do this just now - he would prefer the Security Officer to raise consciousness generally through his work for good security practice overall. DC noted that Kostas would be happy to help - he could review procedures, do a security audit, or 'hack' in? DB asked what would happen if he did that and decided to pull the plug and close Imperial down? DB felt it wasn't the best time to do this. AS asked when was a good time? Distraction at present wasn't good. TD suggested the drafting of a Questionnaire. SL noted that this then required follow-up. TD noted that it could be discussed at the dTeam first. SL asked about members of the Vulnerabilities Group? TD advised that they were engaged in assessing vulnerabilities. JC confirmed that they co-ordinates vulnerabilities raised within the project. AF reported that the Grid Vulnerability Group was quiet at present. AS asked if there were any areas that could be improved? What were the five key issues? A Security Service Challenge would show up deficiencies and give an assessment - this could be a planned challenge plus questionnaire. There was a discussion on SSH keys and it was noted that documentation generally needed to be improved. 8. AOB ======= It was noted that JG had sent an email providing reasons for the proposed central distribution to worker nodes. Two issues were involved: 1) two VOs may require different WN releases; 2) when problems arise sites would get ticketed. TD advised that JG JG reported on the plan to push-out middleware. AF noted that this was another level of responsibility for VOs and was not helpful. AS suggested that this would not be a helpful model - a third party would be pushing updates onto sites - a better idea would be a repository like the AFS area and sites could choose from there. TD noted it was an issue of whether you allow receipt, or go and get the software yourself. DB noted that JG didn't want experiments putting in uncertified middleware. TD responded that EGEE were proposing that this was the basis of managing this issue in the future. DB suggested that a response to JG should be that a certified version could be imposed but tickets would not be given, however the experiment won't be able to use the site - it would be better to let them use uncertified versions - at least then some work would get done. DB noted this was a 'half-way house' - it would be ideal if the worker node were at CERN and it could be used. SL advised that that's what we use at the moment. AS noted concern that sites weren't updating. DB suggested that we don't understand what problem in the UK this addresses. TD noted that the EGEE point of view was that using older versions was preventing forward momentum. DC noted that if you want a WMS you have to take the CERN repository. DB summarised that JC should respond to JG by informing him that this issue was carefully considered at the DB and JC should iterate with him and provide JG with the main issues/pros and cons. [Done following the meeting] JC noted another issue of a UK repository, to be used as a buffer. AF reported that Manchester had one. JC noted an update recently - increasing the probability of untested software. SL noted that this might stop people putting untested software into the repository. DB advised that it was trying to make things better for the UK - a central and UK repository - however it took one person to do this and it was a difficult thing to take something from central to the UK - it would help to have a repository of stuff that was known to be good, especially for the smaller sites. It was agreed that JC should take this issue to dTeam and balance convenience with someone having to do it. ACTION: 03.03 JC to take the issue of a UK repository to the dTeam to discuss balancing the convenience of having it, with the effort required to set it up. ACTIONS AS AT 05.09.08 ====================== 01.9 TD as Technical Director to highlight to Ganga that they should document issues in relation to error messages to users. 01.10 JCatmore to begin to devise a comprehensive list of error messages that would assist users. 01.13 GP to report-back on the status of MICE. 01.15 DC to provide updated LondonGrid MoU. 01.16 RJ to provide updated NorthGrid MoU. 01.17 PC to provide updated ScotGrid MoU. 01.18 PW to provide updated SouthGrid MoU. 03.01 AS to investigate LFC recovery for ATLAS. 03.02 JC to follow-up on the LHCb calendar view of downtime, to see if it would be useful elsewhere. 03.03 JC to take the issue of a UK repository to the dTeam to discuss balancing the convenience of having it, with the effort required to set it up. SL noted that the next meeting would be by phone in three months' time, probably early December, in the afternoon. A date would be circulated in due course. The DB meeting closed at 2:50 pm.