GridPP PMB Minutes 351 - 22nd June 2009 ====================================== Present: David Britton (Chair), Sarah Pearce, Tony Doyle, Andrew Sansum, Robin Middleton, Steve Lloyd, Jeremy Coles, Dave Colling, Pete Clarke, Roger Jones, Glenn Patrick, John Gordon Apologies: Neil Geddes, David Kelsey, Tony Cass 1. OPN Backup Link =================== DB reminded that we had been asked for an update on this by the OC and consideration of this should include STEP'09 breakages and any Tier-1 inputs. DB advised that a document had been devised originally by Robin Tasker and updated by PC - DB suggested submitting this to STFC prior to the OC - the document had been circulated to the PMB last week. Were there any comments? AS, PC, JC, JG all confirmed the document was fine. It was agreed to go ahead and submit this directly to STFC. ACTION 351.1 DB to submit OPN Backup Link Document to STFC for consideration prior to the next OC. 2. LHC Network Forward Look ============================ PC had circulated a document - RJ, DC, AS were to finalise this. PC had decided to finalise the document and was giving the last call for CMS & ATLAS to add anything? The document was circulated for comment and would be deposited as this year's version. DC noted that he would send a presentation with up-to-date info not yet public, but it was in line with the document anyway so no changes were required. DB noted a few typos and would send these to PC offline. JG asked whether it would be useful to extend the period to include STEP'09? PC advised that the document already contained the STEP'09 plots. DB noted that the document looked good. PC asked that everyone looked at the conclusions, he noted that they should be valid for 09/10; this was a separate issue from the backup link for resilience. TD noted that 2GB/second might be required at Tier- 2s this was discussed at the end of the document). DB asked that the PMB send comments to PC over the next week or so. PC to amend the document with new figures from RJ, if provided. PC noted that he felt the document should be finalised soon in any case, with or without new inputs, as the issue was always a moving target. This was agreed. ACTION 351.2 PMB to send comments to PC over the next week regarding the LHC Network Forward Look document. 3. OC Preparation ================== DB reminded that the next OC would take place on 15th September. DB had circulated a preparation list on 11th June showing documents and reports that were required for this meeting. DB noted that this was difficult over the summer period, but the documents required would need to be completed by end of August. - The main document should be max 20 pages if possible. - The Project Map and Resource Reports were required. - The OPN Backup Link document should be included for info. - A CASTOR report was required - separate from the general Tier-1 report. (This CASTOR report would be used to close-off CASTOR as a 'special' issue - from now on it would be included within the Tier-1 Report). - Something on EGI/NGI/NGS convergence was required - JG agreed to co- ordinate this. DB asked for comments? Was anything missing? TD asked about LHC schedule? DB advised that we would get an update on the LHC schedule on Sept 8th during GridPP23 - and we would take that information in to the OC meeting. DB noted that the timeline was AUGUST - drafts of the PROJECT STATUS PAPERS should be sent to DB by 10th August (he is travelling the following week); OTHER PAPERS were due by 17th August. ACTIONS 351.3 ALL: drafts of all reports for the OC to be sent to DB by 10th August (project status papers) and 17th August (all other papers). 4. PMB Dates ============= DB advised that we should take the opportunity to reduce meetings whilst possible. DB proposed a new list of PMB dates, as follows: June 22nd - PMB (today!) June 29th - canceled July 6th - PMB July 13th - PMB July 20th - canceled July 27th - PMB Aug 3rd - canceled Aug 10th - PMB Aug 17th - canceled Aug 24th - PMB Aug 31st - canceled (UK Bank Holiday) Sep 2nd - special PMB to address issues re OC meeting Sep 7th - F2F at Cambridge DB asked for comments on this strategy, and the dates proposed. SP asked about a discussion regarding the OC specifically? DB noted that Tuesday 1st September was possible? RM advised that RAL was closed on that date. DB proposed the Wednesday instead? This was agreed: a special PMB meeting would take place on Wednesday 2nd September at the usual time, in order to address issues relating to the OC meeting. 5. GridPP23 Agenda =================== DB advised that the PMB at GridPP23 would commence with lunch at 1:00 pm (to allow for travel difficulties). The meeting proper would begin at 1:45pm and could continue on to the next day if required. Regarding the main meeting on Tuesday - this was not focussed on users, but on final steps to LHC data, and Roger Bailey (Chairman of LHC Commissioning Working Group) had agreed to come and give a final update on LHC status. DB then proposed sessions of the four experiments (CMS, ATLAS, LHCb, ALICE) giving reports on their performance during STEP'09, with conclusions. This would be useful for sites to get co-ordinated feedback - speakers were required. DB noted that there were two options following this: either a discussion re users/VOs, or re the future EGI/NGI/NGS/GridPP/EGEE situation - but this should not be a set of statements, rather, a discussion was preferred. Were there any comments? RM noted that as this was a meeting for GridPP users, he favoured the first idea - the discussion re users & VOs. The impending change should be seamless to end users and the background organisation of this didn't really affect them directly. JC noted that the EGI/NGI/NGS situation could be mentioned elsewhere during the day. DB noted that we had indeed done this last time. DC commented that users don't really attend the GridPP Collaboration Meetings. It was understood, however, that it was an opportunity for users to give feedback - both good and bad. SP noted that a user session would be useful, but a discussion re NGI/NGS etc would be preferred, rather than a presentation. PC thought that the people attending wouldn't really care about this issue. SP noted that people didn't kinow how things would affect them over the next few years, especially at the Tier-2. DB noted that we could consider an EGEE/EGEE VO view, and get Steve Newhouse to speak. JG advised that by September, things would be well along the line regarding proposal and organisation etc. RM suggested, therefore, a 10-15 minute update about what was going into the proposal. It was noted that a joint meeting with NGS was an option, and combine the last two sessions, and come to a common view. Dave Wallom and Andy Richards could be invited. It was noted that all users could be incorporated, not just particle physics users. SP thought this was a good idea. It would be possible to have a discussion on user possibilities/discussion & presentation/accommodating othe practical convergence issues. DB noted that we would need to approach Dave Wallom and Andy Richards and ask if they were free on the wednesday (9th), and see what they thought. ACTION 351.4 DB to contact Dave Wallom and Andy Richards regarding a contribution to GridPP23 and report back. It was noted that we should perhaps also include a talk from the host relating to local developments (home of Camont). DB noted there could be a discussion session before dinner (ref Pegasus last year). DB suggested VO day on the Wednesday - VOs using EGEE & NGS infrastructures - he would discuss this with Andy & Dave. 6. Week's Notes ================ LHCC computing review July 6th. Andrew Pickford at Glasgow to help with GridMon. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric ------ 1) R89 migration has commenced. - WMS is draining - Batch farm is not accepting jobs - Suppliers arrive this morning to move CPU hardware (3 days) - Main service shutdown on Thursday Full schedule is on the blog at: http://www.gridpp.rl.ac.uk/blog/2009/05/14/schedulemovenewbuilding/ 2) The start of our own acceptance tests has been delayed owing to the need to prioritise R89 migration preperation. This week is a difficult week and although the team plan to start the testing they may find they have no time. Load testing takes 1 month in theory, although typically 2-3 weeks more in practice. This should still leave us able to deliver the capacity before the end of August. 3) Robot installation is complete and is ready for drive installation during the R89 migration. 4) The site needs to carry out a major network upgrade (to remedy problems they have had for over a month). Proposed date is now 7th July 2009. Staffing -------- 1) The first experiment support post has been accepted. The second post has interviewed. 2) The EGEE PPS recruitment failed and we are seeking authorisation from STFC to re-advertise. 3) The YII student (funded by ESC) is expected to start in July. 4) The CASTOR d/b admin has interviewed. 5) A decision has been made to move ahead with quator as our installation service. Service ------- 1) SAM availability last week was 100%. 2) CASTOR a) We are still waiting for the BIGID fix from ORACLE. b) A problem was identified on the LHCB instance where too few slots were dedicated to rootd 3) STEP09 Has now ended. Full details are at: http://www.gridpp.rl.ac.uk/blog/category/step09/ 4) A fix was put into the CE to handle the 32K directory limit encountered during STEP. Unfortunately our first attempt didn't resolve the problem. Tier-1 STEP'09 Report --------------------- AS reported as follows: I haven't a summary of STEP09 at the Tier-1 other than the multiple blog posts at: http://www.gridpp.rl.ac.uk/blog/category/step09/ Key issues summarised as a very rapid memory dump. Operations were very smooth and reliable. No callouts during STEP (other than right at the start and end). day time operations very calm. - Main operational Problems * Twice failed to accept ATLAS jobs owing to 32K directory limit problem. * CMS migration stalled owing to incorrect migration policy. Promptly fixed by James Jackson * ATLAS tape migration crashed (cause unknown) and then jammed in inconsistent state preventing further migration. Migration queue built to extent that mighunter crashed - hard to clean up queue. Scripts now in place to automate cleanup. * One bigid crash took out ATLAS for several hours despite session killer. * A robotics problem took CASTOR tape drives offline. - Load related problems * CMS missconfiguration caused 30Gb network rates * Network uplink from Tier-1 to RAL limited to 10Gb (ran at 8Gb sometimes) * Very few end server load effects. Most servers lightly loaded. * Peak rates to SJ5 of 6Gb/s were seen. - Other issues * Could not fill farm owing to 3GB ATLAS jobs. Situation improved after increasing memory overcommit to 50% (MAUI) * ATLAS job submission not sufficiently aggressive. ALICE took slots even when at low target share/priority. before ATLAS could take up slack. * Tape drive performance nearer 35MB/s than 45MB/s (long range planning). CMS improved on this late in STEP but have not completed the analysis. Overall drive capacity was just about adaquate but lacks sufficient margin. * Job efficiencies generally good > 90% with some exceptions. SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that they were in recovery after STEP'09 and there was not a lot of activity at present. They were checking tests, had picked up London as not being healthy, this could be due to a variety of things coinciding. On the dashboard monitoring, Lancaster was low in job efficiency because of one outage due to a power cut. Glasgow was standing-in for the WMS but this seemed to be strained. Re STEP'09, more info was available, and there was a plot available to show reprocessing. During this period of reflection, a Post Mortem was planned for 1st July - a couple of reports have been done, for Glasgow and Lancaster. Many sites in the Tier-2s had to cap analysis jobs - datasets with a lot of files caused problems, and there were subscription issues. Glasgow was currently being tested as an alternative distribution point if the Tier-1 was down - it seemed to be going well. The report was on the website. SI-3 CMS weekly review & plans ------------------------------- DC reported that AS was also covering things that had been affecting CMS at the Tier-1 during STEP. RAL had been the second best Tier-1 and the CMS experience had been generally positive. Planning was taking place at the moment. DC would send AS the new planning figures on tape rates - which were more in line with what was generally expected. DC noted that the best performing Tier-1 had been Fermilab. SI-4 LHCb weekly review & plans -------------------------------- GP reported as follows: Issues of the week: ------------------ 1. Problem with Castor configuration at RAL - the limit on the number of connections allowed by rootd was 100, and many (>300) jobs running off d0t1 servers (3 of them). This led to intermittent failures of jobs. Problem seen on 16 June, identified and solved on 17 June by Shaun by increasing limit on number of rootd connections to 200 per diskserver. 2. Load related problem on Sheffield Tier 2 site. A large number of jobs have failed over the weekend due to possible network problems. Site seems to have problems as they have a large number of ATLAS jobs which also are seeing problems. Banned for now pending joint investigations with the site. 3. DIRAC server found to run at its hardware limit when serving ~8K jobs. This is now because of more services coming online and more jobs (including user jobs) with shorter lifetimes and more heartbeats. Work ongoing within LHCb to solve this. Outlook for the week: --------------------- 1. RAL down for moving hardware to new building. 2. Billion event production continues at Tier-2s. 3. User analysis jobs. Experiment weekly reports ========================= Experiment weekly reports can be found here: http://www.gridpp.ac.uk/pmb/WeeklyReports/ SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) The Glasgow team have now circulated a report on their performance and issues during STEP09 (with a focus on ATLAS). This is being used as a template for other sites to follow to review their own performance. Are there any specific questions from the PMB that each site should (attempt) to answer? One area of current concern for example is internal site bandwidth. Larger sites may have to provide additional bonded links. DB noted that local bandwidth issues should be asked of sites, and commonality should be available. 2) The camont tests suggest that the connection rates for the n-gram work are not prohibitive even at the desired level so the core issue now is getting a better understanding of the compliance and legal liability issues (i.e. what happens if the web spider touches the wrong sort of website). For the latter it is really only being clear on the VO/our policy - there are many other academic projects using (often more intensive and invasive) spidering. The camont version does not download anything and providing it respects robot files there should be no issues with regards to fee paying sites. DB noted that we should recognise that this was a test and should not get mired down with lawyers and legalities - we should take experience from other people who have allowed similar academic exercises involving web-spiders. 3) We have run our first week with regional on-duty operations in place. There were a few minor glitches with the portal (run from IN2P3) such as tickets not being associated with the dashboard alarm correctly and delays in alarm states being changed (for example from New to Off). Other than these minor things (and a problem with the mail list) the new mode of operation is fine. 4) Over half the GridPP sites now have DN publishing enabled. We are gently reminding other sites that we would like them to progress this in the coming weeks. Are any of the LHC VOs already looking at user data (i.e. how urgent is this from an LHC VO perspective)? 5) The timetable for the WLCG STEP09 post-mortem workshop is here: http://indico.cern.ch/conferenceTimeTable.py?confId=56580. Our current plan is to have the Tier-2 coordinators present (Robin has asked for costs to be managed) plus a few people who took a leading role in specific STEP09 tests. We have yet to confirm who will attend from the Tier-1. DB confirmed that people who have made contributions in this area should be funded to go to the WLCG Post Mortem meeting - eg, Sam Skipsey and Peter Love. 6) Once again, as new people join our operations effort there are questions about whether there is a document detailing the "lifecycle of a grid job". Unfortunately we never got that much requested architecture document and I do not believe we are in the position to write it ourselves. Would anyone in the PMB like to comment!? There may be documents that have escaped our notice. SI-6 LCG Management Board Report --------------------------------- It was reported that Jamie's Ops Report was effectively the summary of STEP'09 - nothing bad was highlighted about the UK. There had been a report on CMS Sam Tests and a review of the last quarter - there were two views of these: one was the GridView view, which isolates things where problems are attributed to the site. The CMS dashboard includes CMS own issues. DB could forward the urls. JG gave a summary from the GDB about SL5, gLexec and CREAM - we need to take a decision on migration; experiments and sites need to test the meta- package. Oxford had volunteered, also Karlsruhe. SI-7 Dissemination Report -------------------------- SP reported that Neasan O'Neill was chasing-up the CERN press release - a quote had been requested (and provided) but no further info was forthcoming. If CERN could not do this, it was agreed that we would release a UK version with data - SP would circulate a draft. DB asked if the press release would fit-in with the Post Mortem workshop? Was the global picture available, or should we wait? SP noted that we could wait but the press release didn't have to contain a lot of detail. DB advised that there should also be a news item on the website. TD noted that we could easily make a UK statement, quoting general satisfaction with outcome. DB noted this had already been done. SP reported that Neasan O'Neill was at the British Science Association Conference relating to communication via the web. REVIEW OF ACTIONS ================= 332.1 AS to provide a plan for the tape drives: this was being finalised this week. Ongoing. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). JC reported that there had been a change of responsibility within EGEE - he was checking on the best person to ask. Ongoing. 345.1 JG to speak to RT regarding GridMon and GridPP funded network effort. DB reported that he had offered that a sysadmin at Glasgow get involved. GridMon had been raised with RT but no response had yet been received from Mark Leese. DB needed to speak to JG. DB had spoken to both Robin Tasker & Mark Leese - Andrew Pickford was now dealing. Done, item closed. 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. Ongoing. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - he needed to extract data but was busy with STEP at the moment. Ongoing. 350.1 DB to investigate the possibility of submitting an abstract to the AHM. DB had contacted JC, AS, and Martin Bly - this would be possible, but DB had to do it - three pages were required, which might be done. Item closed. 350.2 DC to investigate the possibility of submitting an abstract to the AHM. Ongoing. 350.3 PC to investigate the possibility of submitting an abstract to the AHM. PC had pursued this in relation to data management with Greig Cowan. Done, item closed. 350.4 AS to investigate whether the Glasgow BDII can be tested as a backup UK BDII during the downtime associated with the move to R89. AS noted that this had been discussed at the UK II and agreed yes, but there had been no plan to progress it and no Tier-2 agreement gained. AS would forward the email to DB and he would progress it with the ScotGrid team next door. Done, item closed. 350.5 JC to check and verify that the contact list on the GOCDB is up-to-date - to be done by September. Ongoing. 350.6 GP to check and verify that the contact list on the UB pages is up-to-date - to be done by September. Done, item closed. ACTIONS AS AT 22.06.09 ====================== 332.1 AS to provide a plan for the tape drives: this was being finalised this week. AS reported that his documents had disappeared and he would have to rework this. He would hopefully have it by Wednesday. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). JC reported that there had been a change of responsibility within EGEE - he was checking on the best person to ask. 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - he needed to extract data but was busy with STEP at the moment. 350.2 DC to investigate the possibility of submitting an abstract to the AHM. DB would contact him. 350.5 JC to check and verify that the contact list on the GOCDB is up-to-date - to be done by September. 351.1 DB to submit OPN Backup Link Document to STFC for consideration prior to the next OC. 351.2 PMB to send comments to PC over the next week regarding the LHC Network Forward Look document. 351.3 ALL: drafts of all reports for the OC to be sent to DB by 10th August (project status papers) and 17th August (all other papers). 351.4 DB to contact Dave Wallom and Andy Richards regarding a contribution to GridPP23 and report back. The next PMB would take place in a fortnight's time: July 6th at 12:55 pm. Agreed further dates were as follows: June 29th - canceled July 6th - PMB July 13th - PMB July 20th - canceled July 27th - PMB Aug 3rd - canceled Aug 10th - PMB Aug 17th - canceled Aug 24th - PMB Aug 31st - canceled (UK Bank Holiday) Sep 2nd - special PMB to address issues re OC meeting Sep 7th - F2F at Cambridge