GridPP PMB Minutes 356 (24.08.09) ================================= Present: David Britton (Chair), Roger Jones, Sarah Pearce, John Gordon, Andrew Sansum, Glenn Patrick, David Kelsey, Andy Richards, Robin Middleton, Tony Cass, Pete Gronbech (for JC) Apologies: Tony Doyle, Steve Lloyd, Jeremy Coles, Pete Clarke, Dave Colling, Neil Geddes 1. Status of OC Documents ========================== PMB-138-Project Status ---------------------- DB reported that he had circulated the first version today. The intro was in point form only. TC advised that he would provide his info soon. DB noted that there was dubiety about the EGI/EGEE status - a section was required on the EGEE Design Study and also EGI/NGI/SSC - or should all of these issues be in the external document? RM advised that it was worthwhile saying that the future is EGI, within a separate document. RM would cover the EGEE Design Study and link this into JG's document. ACTION 356.1 JG to deal with EGI issues within that section of the OC document. DB advised that the Tier-1 status looked fine. AS noted that he had been overtaken by events at RAL, and wanted to amend the machine room status info. DB advised that he had not amended the document yet, but could drop a new draft in without difficulty. DB asked about UK Deployment Status - he did have some concerns about this section - the table of numbers looked wrong re disk MoU, and DB had asked JC to address this. TD had reported that the Technical Director's report might not be required. DB had forwarded issues that he might cover but they had not managed to converge yet to discuss. SP reported that TD had sent her some notes but not all were relevant to the Project Map - was a technical overview required? DB said he would deal with this offline. DB reported that the User Reports looked fine - they needed to harmonise in terms of length, and also some of the content needed to be checked for consistency. GP had provided the User Board Report. Neasan O'Neill had provided info on dissemination and KE/EI. DB advised that he would continue to work on the draft tomorrow. PMB-139-Project Map ------------------- SP reported that she had circulated the first draft of the Project Map Report today. This was almost complete, and included updated figures for the Risk Register. PMB-140-Resource Report ----------------------- SP advised that the Resource Report was not yet complete - she was working on hardware figures and was awaiting info from AS. AS noted that he would provide the figures today, but did not advise publishing all information in minute detail, as this would be unnecessary and counter-productive. DB noted that he needed the document by Thursday a.m. There was a new regime with the new OC and they were being ultra-cautious. We needed to ensure consistency within the info provided. DB noted that re the Risk Register, he was happy with SP's suggestions - SP should go ahead and finalise it. SP advised that any other updates were welcome - she would need to review the OPN wording. PMB-141-OPN ----------- PMB-DB suggested taking the OPN document that had been written several months ago, and adding to it with: 1) pre-history, assessment & timeline; 2) questions the OC had raised (which we have already answered) - all the information should be incorporated into one document. DB advised that he would do a presentation to the OC, explaining that this was not a technical issue, and not a new commitment, also we were not asking them for new funding, and were not moving between budget lines - this was a Management decision and they should not be judging it technically, but it will be on the OC Meeting Agenda. RJ asked if DB had conversed with Tony Medland? DB noted not yet - he needed to go through the process, as Trish had made their position clear. PMB-142-CASTOR -------------- DB asked about the CASTOR document? AS advised that he had minor updates to do. DB needed this soon. PMB-143-EGI/NGI --------------- Re EGI/NGI, JG had circulated a draft today which included SSCs. The document was a snapshot of an evolving situation - there will be an SSC for particle physics. DB noted that he needed to receive comments on the documents and get things finalised this week. The F2F was on the 7th; the special PMB was scheduled for Sept 2nd in order to sign-off the document. The whole thing needed to be read as a set, and finalised on 2nd. 2. e-Science Review ==================== SP had circulated the draft document. DB had some comments and questions, but the document read well overall. Re the note on STEP'09 and draft distribution rates, the question to RJ was whether he could put these numbers into perspective? Do we have targets or rates? RJ noted that he could provide info re rates to the Tier-1. DB advised that he needed wording like: 'these figures exceed the requirements of the Tier-1 for initial running' - or something similar - whatever RJ felt was appropriate. DB asked how many Tier-2s there were? RJ noted 35 or 36. DB noted that in relation to the Glasgow situation, the numbers might be more in the context of 50 or 60 sites (as opposed to Tier-2s). RJ would provide info on the Tier-2s. ACTION 356.2 RJ to provide DB with targets/rates context for STEP'09 and draft distribution rates; RJ to provide text on figures meeting the requirements for Tier-1 running; RJ to provide DB with info on Tier-2 numbers. DB noted that he would do another draft and circulate this today re the review. Comments were needed a.m. tomorrow, prior to the deadline. It was noted that we did not have citations or papers to add. SP would hold the token for the proforma and DB would work on the document - they would exchange tomorrow. 3. Disaster Management ======================= 1. Aircon - AS advised that this had been a level 3 disaster, they were dealing with it via a subset of GridPP PMB members. It had been down for a number of days and was re-started on Monday of last week, the situation had now dropped to level 2. The cause for the system being turned off, was now disabled - it had given a signal that it was over-pressure, so the system was automatically taken down. The level set on the sensor had been too low. The pumps will not be shut down now - the system indicates that an engineer call-out is required. The incident was unlikely to recur as the sensor now can't initiate a shutdown. 2. Water leak - AS reported that the system on the 2nd floor cools the offices on the 1st - condensation had been dripping off the chiller system into the driptray, this overflowed, and the water tracked across the floor and onto the robot, which took out the electrics. This has since been rectified. 3. Swine Flu - It was noted that a procedure was being worked out - RAL have had one member of staff struck down with this so far, but the person is back at work now. The outstanding question was that if we see a recurrence, what would they do about attending GridPP23? DB noted that it had been a single case so far and was therefore not a cause for concern. However, if other cases within the team were reported then it would be prudent not to attend - this was a decision for the Tier-1. 4. Procurement - AS reported that a tranche of servers had failed the acceptance tests, and they had also failed the test of the supplier. This was disappointing at present, meetings were taking place and the firmware was being tested. The disk units had been supplied by another vendor, and we would know about them on Tuesday. A process was being followed but there was no solution as yet - AS advised that deployment before the end of September was unlikely. 5. Linux vulnerability - AS noted that he had not managed to get this up and running for the experiments, work was ongoing. They were trying to firm-up a plan for returning the servers. PG noted that they had turned the UI off at present, however most sites usually blocked the modules and access to the kernel - was than an option for the Tier-1? AS advised that they weren't running any servers at all, so it hadn't made any difference to turn off the UI as well. They had wanted to turn back on, on Thursday, but had run out of time. New kernels had arrived from RedHat today. AS noted that the issue had only affected a small number of Grid users. DB advised that as they were currently running 4 other Disaster Management procedures, it would be better not to use the DM system formally for this additional issue. 4. Dissemination in EGI and SSC ================================ SP reported that they have two dissemination 'prongs' at EGI - one for SSC in relation to joint dissemination and training (KE), Neasan O'Neill was leading that. The other was for dissemination as a Global Task within EGI - two people would possibly be funded within EGI and based at Amsterdam. An Expression of Interest was being drafted at present in order to host the Global Task as a back up in case Amsterdam were not interested. Then, SSC for HEP would have a dissemination element, and this was a definite area of expertise for us. DB noted that this had arisen due to Jamie asking for work package inputs in SSC, which contained HEP - we could envisage running the dissemination tasks. SP asked about other EGI Global Tasks? JG noted the list as follows: - security policy (DK) and an Operational Security Post - Accounting - APEL - GocDB - Training (Edinburgh) JG asked if there was anything else we wanted to bid for? DB thought that we were already bidding for enough, and there didn't seem to be anything else on the list that was suitable. JG said he would get NG to submit these as the UK EGI representative. JG noted that there would be a process to split-up the money in EGI relating to international tasks, but this was not known yet. JG asked if there was scope for NA4 at Glasgow? DB would look at that. 5. Weekly Notes ================ It was noted that GP had not yet received the ATLAS or CMS RAM requirements. RJ would provide these (based on a two-year forward look). LHCb hardware numbers had still not been received - RN was working on them and they should be available soon. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: ------ 1) Cooling problems in the machine room w/b 10th August caused severe disruption to the service. Two breaks in service, one short lived (<12 hours) and one of 6 days. Fault is partially understood and relates to pressure trigger levels in the coolant system and how they might trigger a shutdown of cooling by the building management system. Changes have been made and the exact failure mode cannot happen again however work continues to gain a full understanding. 2) A water leak developed above the robot and caused some damage, however this appeared to be mainly cosmetic splash damage. Cause was condensation overflowing a drip tray upstairs. 3) Acceptance tests of the CPU have been impacted by cooling shutdown and unexpected Staff absence. It is likely that they will still be available by the end of August as planned. 4) 35 disk servers of one lot are ready for deployment. Ten more will be out of testing by the end of August and ready for deployment. The five remaining nodes are further behind due to various hardware faults. 5) Acceptance tests of the second lot of disk has encountered a problem. We are Following a range of investigative avenues. We have new disk and RAID controller firmware under test and drives are being investigated by the drive manufacturer. We have had two meetings with the supplier and are satisfied with their response so far we are Scheduled to meet again on Tuesday. We have no estimated completion date at the moment. 6) New procurements have started. - Disk PQQ is being evaluated (scheduled for completion on 28th August) - CPU PQQQ is running 7) Following the recently announced security vulnerability. Access to the UIs was terminated and has not yet been re-instated. Staffing: -------- 1) The first experiment support post has started (Andrew Lahiff). The second post has interviewed and negotiation is underway. 2) The EGEE PPS post has started. Service: ------- 1) SAM availability for the OPS VO was 80%. Downtime mainly due to the late start following the air-conditioning problems. 2) CASTOR - Unscheduled downtime Sunday night caused by CRLs not correctly updated. 3) The Tier-1 team continues to plan and prepare for Swine Flu. The Tier-1 team had a work at home day last Wednesday as part of our preparations For Swine Flu. Although a few staff remained at work (for a variety of reasons) most Worked from home. The day was largely successful and will be repeated again in a few Weeks time. 4) The FTS and LFC have been moved to a new resilient ORACLE RAC and the ATLAS LFC has been separated from the general LFC. 5) Work is continuing to move the 3D service to the same hardware. 6) Final details are being agreed for the SL5 migration. The main bulk of this work is scheduled for mid September. SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that there had been disruption at the Tier-1 last week. Things were quiet generally. There had been miscommunication regarding 3D and the Oracle servers - the issue had not been brought to the Tier-1 meeting. RJ would forward the information to AS. SI-3 CMS weekly review & plans ------------------------------- DC was not present. SI-4 LHCb weekly review & plans -------------------------------- GP noted that they were currently validating previous monte carlo low-level production. They were cleaning up disk space at the Tier-1. SI-5 Production Manager's Report --------------------------------- Pete Gronbech provided the following report (pp JC): 1) The Tier-2 testing using ATLAS Hammercloud submissions is continuing. At last week storage meeting RHUL had carried out tests and got better results with RFIO and small buffer sizes than the current file staging method. UK sites would like ATLAS to submit a hammercloud test using the RFIO method again. This time all sites would be configured to use 4k buffer sizes. We need to wait for Graeme and Peter Love to request this from ATLAS. 2) Many sites have upgraded to DPM 1.7 (1.7.0-6) those that remain will update to 1.7.2. The SL5 version for the server node has been released and we hope pool node software will be available shortly. Many sites would like to run their storage servers under SL5. Concern was expressed that the dpm-drain command was still extremely slow, taking weeks to drain 8TB. 3) ROD work was hampered by the problems at RAL. The GOCDB was running in failover mode, but the CIC portal did not cope well with this. Daniela reported: “ROD duty was uneventful this week, even UCL-CENTRAL responded. COD is still too enthusiastic reminding us of 'problems' that aren't any: The problem is that alarms age even if the site is in downtime, so when RAL came out of downtime last week and the state of the site turned green I wasn't quick enough to get rid of the old but now OK alarm triggering the usual micromanagement.....” 4) SL5 migration plans at sites have been developing. T1 has a few WN’s behind a ce and will convert the main cluster mid Sept. Some London sites have set up test nodes, and others such as UCL CC have external constraints that will mean delaying till January. Glasgow now have 112 job slots running SL5, and plan to complete migration by 14th Sept. Durham by 18th Oct. Edinburgh plan not known. SouthGrid, Oxford have 16 job slots and plan to migrate gradually with the majority moving in September. Other SouthGrid sites are in the process of setting up test nodes. 5) Security update: An announcement on 14th August about the privilege escalation vulnerability in the kernel (CVE-2009-2692). A work around recommended by RedHat https://bugzilla.redhat.com/show_bug.cgi?id=CVE- 2009-2692#c10 was applied rapidly at most UK sites on the same day or early the next week. RedHat have released an fixed kernel this morning (24th August), so the SL fix should be available shortly. There was a discussion on HEPSPEC06 benchmarking - PG reported that most sites were doing this now, and it was reported on the GridPP wiki. There was an outstanding issue of reprocessing data - have all sites been asked to do this? And, for the next hardware tranche, what would the figures be based on? JG asked if we had enough data now? PG noted that some sites, yes, they had the equivalent HEPSPEC06 value, but a different way of calculating hours used. JG asked how do the HEPSPEC numbers compare with the prior SPECINT numbers? PG noted that they were lower, generally, than previously but the values depended on different types of hardware. DB advised that we needed to defer this issue until SL and JC were present - this issue had arisen originally due to discrepancies found on the ATLAS code which had been raised by Graeme Stewart. ACTION 356.3 DB to discuss the issue of HEPSPEC06 benchmarking with SL and JC offline, and raise an appropriate action following discussion. SI-6 LCG Management Board Report --------------------------------- DB noted that two issues had been raised: 1) discussion of ALICE requirement re the CREAM CE and SL5; 2) SSC developments ongoing. SI-7 Dissemination Report -------------------------- SP reported that Neasan O'Neill was currently preparing for the Festival of Science. REVIEW OF ACTIONS ================= 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. AS had sent a reminder - no reply as yet. Done, action closed. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. Ongoing. 350.5 JC to check and verify that the contact list on the GOCDB is up-to-date - to be done by September. Ongoing. 351.3 ALL: drafts of all reports for the OC to be sent to DB by 10th August (project status papers) and 17th August (all other papers). Done, item closed. 354.1 JC to get more info on e-NMR status and report-back; JC to also raise this issue of GridPP support for them at dTeam. Ongoing. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. Ongoing. 354.4 DB to co-ordinate 16-20 page Project Status report for the OC and ensure it is submitted on time. Ongoing. 354.5 DB to write a 1-page Introduction for the OC Project Status Report. 354.6 TC to write a 1-page report on LCG Status for OC Project Status Report. 354.7 RM, with input from NG, to write a 1-2 page report on EGEE/EGI Status for the OC Project Status Report. 354.9 JC, with input from SL, to write a 2-4 page report on UK Deployment, for the OC Project Status Report. 354.10 TD to write a 1-page Technical Director's report, for the OC Project Status Report. 354.11 RJ to provide a 2-page User Report, to include relevant figures, on behalf of ATLAS, for the OC Project Status Report. 354.12 DC to provide a 2-page User Report, to include relevant figures, on behalf of CMS for the OC Project Status Report. 354.13 GP (& RN) to provide a 2-page User Report, to include relevant figures, on behalf of LHCb for the OC Project Status Report. 354.14 GP to provide a 1-page User Board Report, for the OC Project Status Report. 354.15 SP to provide a 1-2 page summary report on EI/KT & Dissemination, for the OC Project Status Report. 354.16 SP to provide the ProjectMap Report, to include the Risk Register, for the OC Project Status Report. 354.17 SP to provide the Resource Report, for the OC Project Status Report. 354.18 The Case for OPN back-up link, already completed by PC, to be included in the OC Project Status Report - DB to do. This item to be removed, and replaced by a new action: 356.4 A new individual document on the case for the OPN back-up link to be prepared for the OC by DB and PC, addressing all issues required. 354.20 JG to provide a report on EGI/NGI/NGS and future scenarios (point 35 and Action from last OC meeting), for the OC Project Status Report. Ongoing. 355.1 JG to speak to CERN and find out what they intended to bid for; JG would speak to David Fergusson and check the situation re bidding for training etc. Done, item closed. 355.2 ALL - to provide inputs or metrics to JG & AR (looking at pp9-10 specifically, and sending comments to JG & AR) - p4 onwards provides the relevant info. Done, item closed. 355.3 SP to re-organise the e-Science review document into headings based on the Terms of Reference and forward to DB for further inputs, by Tuesday. Done, item closed. 355.4 JG to do a draft Agenda for the e-science review visit. Ongoing. 355.5 DB to send email response re OPN backup link questions, to Trish Mullins. Done, item closed. ACTIONS AS AT 24.08.09 ====================== 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. 350.5 JC to check and verify that the contact list on the GOCDB is up-to-date - to be done by September. 354.1 JC to get more info on e-NMR status and report-back; JC to also raise this issue of GridPP support for them at dTeam. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. 354.4 DB to co-ordinate 16-20 page Project Status report for the OC and ensure it is submitted on time. 354.5 DB to write a 1-page Introduction for the OC Project Status Report. 354.6 TC to write a 1-page report on LCG Status for OC Project Status Report. 354.7 RM, with input from NG, to write a 1-2 page report on EGEE/EGI Status for the OC Project Status Report. 354.9 JC, with input from SL, to write a 2-4 page report on UK Deployment, for the OC Project Status Report. 354.10 TD to write a 1-page Technical Director's report, for the OC Project Status Report. 354.11 RJ to provide a 2-page User Report, to include relevant figures, on behalf of ATLAS, for the OC Project Status Report. 354.12 DC to provide a 2-page User Report, to include relevant figures, on behalf of CMS for the OC Project Status Report. 354.13 GP (& RN) to provide a 2-page User Report, to include relevant figures, on behalf of LHCb for the OC Project Status Report. 354.14 GP to provide a 1-page User Board Report, for the OC Project Status Report. 354.15 SP to provide a 1-2 page summary report on EI/KT & Dissemination, for the OC Project Status Report. 354.16 SP to provide the ProjectMap Report, to include the Risk Register, for the OC Project Status Report. 354.17 SP to provide the Resource Report, for the OC Project Status Report. 354.20 JG to provide a report on EGI/NGI/NGS and future scenarios (point 35 and Action from last OC meeting), for the OC Project Status Report. 355.4 JG to do a draft Agenda for the e-science review visit. 356.1 JG to deal with EGI issues for EGI section of the OC document. 356.2 RJ to provide DB with targets/rates context for STEP'09 and draft distribution rates; RJ to provide text on figures meeting the requirements for Tier-1 running; RJ to provide DB with info on Tier-2 numbers. 356.3 DB to discuss the issue of HEPSPEC06 benchmarking with SL and JC offline, and raise an appropriate action following discussion. 356.4 A new individual document on the case for the OPN back-up link to be prepared for the OC by DB and PC, addressing all issues required. TIMELINE FOR OC DOCUMENTS: Sep 1st: Final version of papers available Sep 7th: F2F at Cambridge and OC papers submitted. Sep 15th: OC at MRC in London FURTHER PMB MEETING DATES: Aug 31st - cancelled (UK Bank Holiday) Sep 2nd - special PMB to address issues re OC meeting Sep 7th - F2F at Cambridge The NEXT PMB will be a special PMB to address issues re OC meeting, and the OC documents; also discussing HEPSPEC accounting information - Wednesdsay 2nd September @ 12.55 pm.