GridPP PMB Minutes 361 (05.10.09) ================================= Present: David Britton (Chair), Sarah Pearce, John Gordon, Andrew Sansum, Dave Colling, Tony Doyle, Jeremy Coles, Robin Middleton, David Kelsey, Pete Clarke (Suzanne Scott, Minutes) Apologies: Roger Jones, Glenn Patrick, Steve Lloyd, Tony Cass, Neil Geddes 1. Security Patching ===================== DB reported that Minchao Ma had sent round an update to the PMB re the patching status. JC advised that only one site was precarious: Bristol. The Worker Nodes were on an HPC cluster and it was the vendor who had to change the kernel for them. The schedule was to perform the update on 20th October, which was past the deadline. It was suggested that Bristol take themselves out of service. TD noted this as a good example of how to remove sites centrally. JG noted that they were uncertified via the GOCDB. JC said no - this was not sufficient - he was discussing the issue with MM. DB suggested that the GridPP deadline be set for next Thursday for Winnie Lacesso to remove Bristol from use? The deadline was agreed as Wednesday 14th. It was agreed that JC would speak to WL in the first instance; DB could write to her explicitly if that would help her. It was agreed that after Wed 14th the site would be disabled. ACTION 361.1 JC to speak to Winnie Lacesso (regarding the kernel updates) about removing Bristol's CE and disabling the site by Wednesday 14th October. DB to write formally if she felt this was required. DB asked about the other two sites? JC noted only Bristol as a problem; for the other two sites - their status was fine and JC advised that communication issues had been improved. Regarding the 'false positives', JC reported that this was due to the primary key in the database not being unique - the cluster name was being used but many clusters had the same names, so for example the problem attributed to Glasgow had actually been Bristol. 2. EGI Global Tasks ==================== DB advised that NG had circulated information. Was there anything we needed to respond to? TD thought that it would be difficult to provide input at this stage. JG noted that these posts had not been bid for yet, but we didn't do them anyway, and matched funding would be needed. DB would respond to NG noting that there would be no response re additional tasks. ACTION 361.2 DB to contact NG re EGI Global Tasks and inform him that there were no additional tasks being bid for by GridPP. 3. Tier-2 Hardware Allocations =============================== It was reported that SL had met with 5 others re the Tier-2 Hardware accounting numbers. UCL and Cambridge remained outstanding but this wouldn't affect everyone. SL had asked about the JET share? Would this be included in the calculations? Should a share of the funding go to SouthGrid to distribute? This was similar to Durham - money went to ScotGrid to dissemante to Durham. JC noted it would be a small effect only - it was more a political issue. DB suggested including JET in the SouthGrid allocation - this would mean a small increase in funds, assuming that JET would continue to provide a service. DB noted it related to a small amount of money, which should go to a HEP service rather than an outside organisation, which would set a precedent. DC advised that CMS had no immediate plans to use Durham. DB confirmed that CMS did use ScotGrid though. JC advised that LHCb were the biggest users of JET but CMS and ATLAS had also used JET over the past year. DB suggested that we use the usage figure and come up with a provisional allocation to Institutes - a few thousand might be attributed to JET, but this would be at SouthGrid's discretion. The figures would then be provided to the experiments for comment. The only decision here was whether usage at JET should be included in the calculation - it should be. The final distribution would be run by the experiments as previously agreed. DB noted he would pass the figures back to SL to include JET, so that he could come up with the distribution which the raw numbers suggested. 4. Week's Notes ================ DB reported that Boston wished to sponser both RHUL and Ambleside. TD noted they had sponsored us for a long time and that this was good news. 5. AOB ======= SP reminded everyone that she needed the Quarterly Reports by the 16th October. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS provided the following report: Fabric: 1) Cooling. We are still tracking the cooling issues experienced in August in the Disaster Management system, but have had no further problems and are waiting until investigations have reached a conclusion. 2) Water leak. We are still tracking this problem in the Disaster Management system, but a recurrence is unlikely owing to temporary measures in place. Meetings have recently taken place to assess what corrective work needs to be carried out. 3) Lot 2 of disk servers have failed acceptance. We are working with the supplier to identify the cause. A possible solution is under test at the moment. A meeting with the suppliers is planned for Tuesday. 4)New procurements have started. - Disk ITT has closed and evaluation will commence this week. Delivery target, December and April. The evaluation has run into some technical problems. We are considering our options but expect a 3 week slippage over the original schedule. - CPU PQQ has been evaluated. The invitation to tender has been issued on schedule. Delivery target February. 5)We have ordered 9 T10KB drives. We aim to move CMS to T10KB by December. 6) Procurement is underway for an additional 4*1Gb/s second OPN link to CERN as resiliant backup. Staffing: 1) Alastair Dewhurst started today in the ATLAS support post. The Tier-1 team are at full complement. Service: 1) SAM availability for the OPS VO was 86%. Weekly production report is at: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Experiments_Liaison_Meeting_Oper ations_Reports RAL is currently in unscheduled downtime since 16:00 on Sunday. This has now been extended to 2pm Tuesday. See CASTOR note below. 2) CASTOR a) We continue to have problems with the CASTOR RAID arrays. One array of the pair of resiliant RAID arrays was taken out of production last week while we worked with engineers to diagnose the cause of recent instability. On Sunday, the second array failed in a similar manner taking the whole of CASTOR down. An engineer is on site and is investigating the cause. An initial SIR is planned to be drafted for the 2pm operations meeting. b) The CASTOR information provider (CIP) was upgraded on Tuesday 29th Sept. This unexpectedly caused problems with the OPS SAM test on the CEs, leading to unscheduled downtime. As far as we are aware there was no impact on the experiments. c) An upgrade to SRM 2.8.1 is scheduled for today. This upgrade fixes a problem identified by ATLAS in the SRM 2.8.0 release fixing the format of the checksum returned. 3) LHCB 3D service was migrated to new hardware today. 4) We handled an alarm test successfully on Friday. 5) The SL5 service now hosts 90% of farm capacity. SI-2 ATLAS weekly review & plans --------------------------------- RJ was absent. SI-3 CMS weekly review & plans ------------------------------- DC reported that the October exercise started today, focussed on skimming, producing etc. This was seen as an episode in teamwork to ensure analysys groups had procedures in place. The Tier-2 would be related usable or not usable - all Tier-2s had been usable except one, which should be up soon. DC advised that there had been Tier-1 problems last week. AS asked about Tier-1 involvement in the October exercise? DC noted no mention of the Tier-1 - it seemed to be a Tier-2 exercise. He would check at the ops meeting today and email AS. SI-4 LHCb weekly review & plans -------------------------------- GP was absent SI-5 Production Manager's Report --------------------------------- JC provided the following report: 1) Only one GridPP/UKI site remains a high-risk with regards to the much discussed kernal vulnerability. That site has a plan to update but this does not come under the direct control of our HEP partners at the site. We have also followed up with sites where there was an issue with communication and have some assurance that the situation is now improved. Also, the matter of test false positives now seems to be (better) understood and results from the use of a non- unique primary key in the (OSCT) database. 2) As of last week (1st October) GridPP sites were asked to switch to publishing their KSI2K values as derived from their HEPSPEC06 measurements. This is confirmed for ScotGrid sites. All but Sheffield in NorthGrid. All but JET in SouthGrid. Two sites are known to have done this in LondonGrid (RHUL and Imperial). A query this morning indicates several instances where the GlueClusterUniqueID fields are incorrectly or incompletely filled. 3) In the accounting discussions on Friday afternoon agreement was reached about the relative impacts of scaling the APEL data for sites given the measured HEPSPEC06 values. The biggest change affects Glasgow where the site was found to be under reporting its "true" KSI2K by a factor of about 1.34 (due to an incorrect KSI2K value given by their supplier). 4) John circulated a request for feedback on whether sites would benefit from funding additional work on developing HEP requirements in the area of AFS. Only a few Tier-2s replied but in those cases AFS was seen as critical for grid and non-grid work at the site. 5) There is a proposal being discussed/assessed (briefly mentioned at the last PMB) about changing the nature and format of deployment team meetings. Some sysadmins would like to join a more regular meeting to discuss and learn about ongoing issues. There is also a desire from the experiment side to find ways to more quickly follow up on issues as we enter a period of increased (new) user activity on the grid. One disadvantage for the core team is that strategy and action review will become less focussed. We need to reach a compromise (new meetings or changed meetings…) in the coming weeks and PMB thoughts on this are quite welcome. DB was not sure of the synergy in relation to the weekly Tier-1? Was this a move from CASTOR to Tier-1? Did they need a weekly ops meeting? DB suggested that a weekly ops meeting from a site view, seemed to be required. JC advised that they were trying not to introduce any new meetings, but rather combine with what they already do. DC advised avoiding going round every site, at every meeting, in great detail - if the dTeam were to remain effective, this needs to be focussed on specific things in not too much detail. DB agreed, advising that the Chair needed to have a strong role. He advised JC to look at the people who wanted to join and work out how best to get them in - the meeting needed to be appropriate for the recipients, a daily ops meeting was also available. DB advised trying something and reviewing how it went. JC noted that more regular ops meetings would be required in the near future anyway. 6) From the WLCG and EGEE availability/reliability reports further explanation was sought for RHUL and UCL-CENTRAL. The August/September results for the former appear to have been the result of the site bdii being marked as in downtime even though the site was running as normal. UCL-CENTRAL has been affected by scheduled updates of their cluster filesystem. SI-6 LCG Management Board Report --------------------------------- DB reported that they had received feedback from the LHCC meeting, there had been criticism of the CRSG process, and non-agreement between them and the experiments regarding hardware numbers. CERN would devise a better way forward. Security had also been discussed - the MB supported EGEE management in this regard. Steve Traylen had spoken about monitoring installed capacity. DB had forwarded the talk to AS and JC. The GDB had given an update on SL5 - action on all to check the gstat2 results. ACTION 361.3 JC and AS to check Tier-1 and Tier-2 gstat2 results (in relation to SL5 having been discussed at the GDB). JC to devolve any action to the dTeam. SI-7 Dissemination Report -------------------------- SP reported that Neasan O'Neill had put up a news item about EGEE and gqsub at Glasgow; he was managing the EGEE presence at SuperComputing in Portland. SP asked whether GridPP wished to send anyone to Portland? It was due to take place from November 16th. If anyone wished to go, they should advise SP or Neasan know. DB noted that we could afford the travel if a couple of people wished to go - if it was reasonably priced. DC advised that he would speak to RJ about this. SP reported that there would be an NGS/GridPP booth at the All Hands meeting. DB asked about the status of papers for All Hands? DC noted that his had been accepted. JG would check. The point was raised about Neasan going to Computing Conferences not being perhaps the best use of resources - it would be better to target a physics-related audience. REVIEW OF ACTIONS ================= 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. ONGOING. 350.5 JC to check and verify that the contact list on the GOCDB is up-to-date - to be done by September. Done, item closed. 354.1 JC to get more info on e-NMR status and report-back; JC to also raise this issue of GridPP support for them at dTeam. Done, item closed. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. ONGOING. 355.4 JG to do a draft Agenda for the e-science review visit. Done, item closed. 356.2 In the context of the e-Science Review document, re the STEP'09 note and draft distribution rates - was it possible to put these numbers into perspective? RJ to provide DB with targets/rates context for STEP'09 and draft distribution rates; RJ to provide appropriate wording on figures meeting the requirements for Tier-1 running, eg: 'these figures exceed the requirements of the Tier-1 for initial running' - or something similar ; RJ to provide DB with info on Tier-2 numbers, ie: how many Tier-2s were there. Done, item closed. 356.3 In the context of a discussion on HEPSPEC06 benchmarking, there were issues of having enough data, and the different way used to calculate hours, also he comparison between HEPSPEC numbers compared with prior SPECINT values. DB to discuss the issue of HEPSPEC06 benchmarking with SL and JC offline, and raise an appropriate action following discussion. Done, item closed. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. ONGOING. 358.2 GP will talk to LHCb and see if they can progress the issue of CASTOR 2.1.8, and come back to us. We would require a strong plea from LHCb that they want this by December. ONGOING. 359.1 In the context of GridPP sustainability as highlighted by the OC, JG to circulate the EGEE document on cloud computing for further discussion. TC to provide the relevant urls from the CHEP talks. Done, item closed. 359.2 In the context of hardware pledges and figures, JG to email Tony Medland and give him a heads-up that figures were coming. Done, item closed. 359.3 SL to convene an Accounting & Benchmarking sub-group, comprising SL, JC, and the four x Tier-2 Co-ordinators, to meet on 2nd October to discuss the figures and follow the action plan as outlined below: Done, action closed. --------------------------- >From the Deployment Board: 05.03 The Accounting & Benchmarking sub-group, comprising SL, JC, and the four x Tier-2 Co-ordinators, would meet on 2nd October to discuss the figures and follow the action plan as outlined above (see points 1-7 reproduced below, with the note following). --------------------------- DC suggested that setting up a sub-group to handle this specific issue would be useful. DB agreed, proposing the four tier-2 co-ordinators plus a couple of senior people (including SL) to moderate - there should be a proposal to go to the Deployment Board. DC proposed the following course of action: 1. specint them all 2. calculate what we can 3. adjust the ones we can't 4. compare the adjustment with those who haven't done it properly 5. if within 10% then ok 6. set-up a sub-group comprising JC, SL and the four Tier-2 Co-ordinators 7. agree timescale Figure was £400k this financial year, from STFC. One month only could be allowed for convergence, as time was short - proposed date was Friday 16 October. SL advised that it should not advantage sites who can't do it. Decisions should be referred to the PMB. This was all agreed. --------------------------- 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). ONGOING. 05.04 dTeam to publicise the 1st October as the changeover date to HEPSPEC06. Done, item closed. --------------------------- 359.5 GS, RN, DC, LB (experiment reps) each to contact Neasan O'Neill, advising where, specifically, the GridPP website should point to for each of their experiments, in terms of user support information. ONGOING. 359.6 SP to ensure that Neasan O'Neill updates the GridPP website accordingly (once experiment reps have provided info as to where the GridPP website should point to for each of their experiments, in terms of user support information). ONGOING. ACTIONS AS AT 05.10.09 ====================== 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. 358.2 GP will talk to LHCb and see if they can progress the issue of CASTOR 2.1.8, and come back to us. We would require a strong plea from LHCb that they want this by December. DB would contact Raja Nandakumar. 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- 359.5 GS, RN, DC, LB (experiment reps) each to contact Neasan O'Neill, advising where, specifically, the GridPP website should point to for each of their experiments, in terms of user support information. SP to follow-up. 359.6 SP to ensure that Neasan O'Neill updates the GridPP website accordingly (once experiment reps have provided info as to where the GridPP website should point to for each of their experiments, in terms of user support information). 361.1 JC to speak to Winnie Lacesso (regarding the kernel updates) about removing Bristol's CE and disabling the site by Wednesday 14th October. DB to write formally if she felt this was required. 361.2 DB to contact NG re EGI Global Tasks and inform him that there were no additional tasks being bid for by GridPP. 361.3 JC and AS to check Tier-1 and Tier-2 gstat2 results (in relation to SL5 having been discussed at the GDB). JC to devolve any action to the dTeam. DB advised that he would be absent next week (Monday 12th October) - JG would chair in his stead. DB advised that he would be away, and probably out of email contact, from Sunday 11th October to Sunday 18th. JG would formally cover for his absence. The next PMB would take place on Monday 12th October at 12:55 pm.