GridPP PMB Minutes 337 - 9th February 2009 ========================================== Present: David Britton (Chair), Tony Doyle, Sarah Pearce, Roger Jones, David Kelsey, Pete Clarke, Jeremy Coles, Steve Lloyd, Glenn Patrick, John Gordon, Robin Middleton, Tony Cass (Suzanne Scott - Minutes) Apologies: Andrew Sansum, Neil Geddes, Dave Colling 1. Quarterly Reports ===================== SP reported that most Quarterly Reports had now been received. However she was still awaiting: - DC: CMS & Middleware - Andrew McNab: Security DB asked if SP wished to raise any issues re the Quarterly Report info next week? SP noted yes. 2. OC Minutes ============== DB reported that Minutes had been received from the OC which had taken place on 15th December 2008. DB asked whether the actions & feedback were covered by ongoing actions at PMB level? a) GridPP to present clear case for the requirement for the back-up link for the OPN to the next meeting This was a current action on PC (332.3) b) GridPP to present the revised MoU costs at the next meeting (this related to Slide 9) It was noted that this slide showed Tier-1 and Tier-2 site resources. SL noted that this referred to a minor error in a table that had subsequently been corrected and was nothing to do with costs. DB would contact Malcolm. ACTION 337.1 DB to contact Malcolm Booy in relation to revised MoU costs and inform him that the error in the figures did not relate to 'costs' at all, and also did not affect the conclusion reached. c) GridPP to outline the prioritisation process and present usage data for other experiments at the next meeting DB noted that he had sent GP an email on Jan 5th regarding the outline of policy - had he received this? DB advised that he had not noticed a reply? (GP noted he would re-send his response; TD noted that the GU email had been down around that time). It was reported that the action re policy was already done - DB would circulate to the PMB. ACTION 337.2 DB to circulate the policy information re the prioritisation process to the PMB. Regarding the other part of the item - usage data - this could be done next time for the OC. d) GridPP to determine the ideal and fall back situations relating to NGS/EGI/NGI and report at the next meeting DB noted that a Working Group had been set up - but no action from the recent F2F meeting had been recorded. ACTION 337.3 JC & JG to form a Working Group with Andy Richards and Dave Wallom to define which Grid services could be run by a UK NGI post April 2011. DB advised that this was on the Agenda for the UCL meeting via JG. JG to ensure this issue is pushed forward. Re 'Feedback Items' in the OC Minutes, it was noted that: e) Point 31 relating to storage work at RAL - DB asked how the extra staff were being implemented? JG confirmed one person employed so far, but JG would need to confirm overall status - a plan was currently in place and JG would circulate details to the PMB. ACTION 337.4 JG to circulate details of current plan and progress in relation to FTE effort on storage at RAL. f) GridPP to provide progress report on CASTOR to the next meeting DB advised that a progress report on CASTOR was required at the next OC meeting. AS to note this. g) Point 34 in relation to GridPP examining the capacity for satisfying the needs of other experiments, and having a process to identify priorities. DB noted that sufficient capacity was problematic given that capital was shrinking, but it would be documented. It was noted that 'other' experiments' requirements were also to be analysed. h) GridPP were reminded to promote all contributions from the project to KE and EI. A paragraph on these topics should be included within the Dissemination element of the project status report. DB advised that we need to look at everything through KE and economic impact - we need to do this continuously. DB advised that a statement needed to be made that the KE post was, however, not funded. PC noted that CERN generated economic impact even if the UK did not exploit this. There was a discussion on KE and other projects involved; use of Grid infrastructure; economic impact in the long term. TD advised that STFC were looking for joint projects - maintenance of a web page would be useful in listing KE projects and EI. ACTION 337.5 SP to request that, via Neasan O'Neill, the web page on KE and EI be made more prominent, and maintained. 3. LHC Schedule ================ An email had been circulated re the Chamonix outcome. There was pressure for a long run from this Autumn. The experiments were expected to provide info on their requirements soon. It was noted that probably two 'standard' years of data would be taken. 4. This week's notes ===================== DB asked JC about potential dates for GridPP23 at Cambridge - previously discussed? It was noted that EGEE was towards the end of September. DB would speak to JC offline and come back to the PMB with proposed dates for GridPP23. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- In absentia, AS provided the following report: We have been running partially (fully this afternoon) unattended these last two days owing to the volume of snow on the roads around here. I have not received an update on R89 since Wednesday, but the builders appear to be making progress. Main stumbling blocks were obtaining a fire certificate and producing sufficient airflow. Both seem to be moving forward but have not yet been completed. New alarms were being fitted in the offices that the builders believe will resolve the fire certificate. Experts were on site on Wednesday to identify where all the missing airflow is going (25% or more does not appear to come out of the vents). This may be a real discrepancy or some form of measurement/modelling problem. I doubt we will have the building by 9th February but it is conceivable that the building can be delivered fairly soon. However with major stumbling blocks still outstanding, considerable uncertainty remains and we will not reschedule the migration until we have the building. SI-2 ATLAS weekly review & plans --------------------------------- RJ provided the following report: We upgraded the site services box at RAL last Monday, and it seems to be working much better. Aside from that, Glasgow's DPM seems to be struggling under analysis loads (which are high); they are upgrading the front ends and experimenting with xrootd to see if things can be improved. It was noted that the issue was to do with X509 authentication. There was further discussion of user access at the Tier-1 - the response will be that we are 'extremely cautious' about it due to the differing Tier-1 capacities. The computing model was unlikely to be changed at present. SI-3 CMS weekly review & plans ------------------------------- In absentia, DC provided the following report: All the CRAFT data has been transferred to RAL for reprocessing (this reprocessing should have started last week and may even be complete by now, I have checked specifically but know that this is under some time pressure). We delayed an upgrade last week so as not to delay this. We are still preparing for the end to end tests. This morning there were some transfer errors out of RAL but haven't seen any resolution on this. Other than this things are pretty quiet. SI-4 LHCb weekly review & plans -------------------------------- GP provided the following report: Fairly quiet week. 1)Final FEST reconstruction activities at RAL quite successful. 4 stalled jobs (out of 51), which stalled at various different times - so not considered to be a RAL problem. 2)Low number of LHCb jobs on the grid - <1000 jobs running at end of last week. Identified as a problem with the agent director, which puts out pilot agents. Probably a bug in the latest version of DIRAC which was put in production on Tuesday/Wednesday. Outlook: Production, user analysis. SI-5 Production Manager's report --------------------------------- JC provided the following report: 1) There is a proposal from EGEE SA3 that all gLite 3.0 clients and services be become obsolete from April 2009. This has the potential to impact Imperial College HEP, Cambridge and RAL T1 because they still run a gLite 3.0 CE. These sites are trying to move to gLite 3.1 in the coming few weeks. Cambridge reported unresolved problems in the previous 3.1 integration with Condor which caused an upgrade attempt in Q408 to fail. RAL is ready to migrate its last gLite 3.0 CE which is used for "small VOs". SA3 are waiting on feedback from sites before taking the proposal to the EGEE Technical Management Board. We need to see how the three sites mentioned get on this week with the upgrades before stating a final UK position. 2) A review of site interventions to correct/remove inefficient jobs shows that very few GridPP sites are actively doing this at the present time. Since there is still overall under utilisation of CPU this is not an immediate concern, but site feedback on these jobs can be valuable for users to correct bad jobs. Some sites have implemented a monitoring system using MonAmi which has shown potential to automate the process of spotting these problem jobs, but issues with the batch system reporting mean that we may gain little by recommending other sites to adopt this system at the current time. However, with ATLAS reconsidering if they actually need 72hr queues (note Martin will be following this up for the T1 after a brief discussion at the Tier-1 strategy meeting last week) this area is one that will be further investigated. 3) UKI overall EGEE availability in January was 95% and reliability 96%. Two sites were called to account for not meeting EGEE targets during January. The accounts are: UKI-LT2-UCL-CENTRAL (47% availability: 55% reliability): - an operational fairshare system was not in place in January as the site had recently brought new clusters online. As a consequence grid jobs got stuck in long queues so the SAM tests timed out although work was being processed slowly. - the site has now implemented multiple queues and recent results validate the improvement. UKI-SOUTHGRID-BHAM-HEP (56% availability: 96% reliability): - in December the site had a number of problems following GPFS errors caused by the loss of a disk MBR. There have been some residual problems with software areas but these should not have affected ops. - The main problem in availability results from the loss of a hardware switch (December 26th) connecting the WNs of the site eScience cluster. A new switch was provided on January 13th - the delay being increased by the vacation period (i.e. people being away). - The problems have all been resolved and it should be noted that throughout both problems mentioned the site continued to run work successfully on its second cluster. We are not yet sure why the second Birmingham cluster did not allow the site to continue passing ops tests! There has also been concern this week that a new node marked "not in production" in the GOCDB still receives experiment analysis jobs. This is being investigated. JC to follow up these problems/issues at dTeam. ACTION 337.6 JC to follow-up at dTeam the problems/issues of two sites (UCL & Birmingham) not meeting EGEE targets. 4) At the last deployment team meeting, staffing levels in LT2 were mentioned as a concern. QMUL and RHUL do not currently have full time site admins. The LeSC admin is leaving very soon. Against this and the LT2 availability during Q408, it still managed to contribute the most CPU processing of the Tier-2s. Recruitment is underway to fill posts so this is an area to watch rather than be alarmed about today. SouthGrid is also seeing changes with John Wakelin leaving Bristol soon and Yves Coppens from Birmingham. SI-6 LCG Management Board Report --------------------------------- DB noted that there had been a discussion on reporting capacity. WLCG were trying to get automatic measurement of installed capacity - OSG cannot do this. There had been an SRM release problem. SI-7 Dissemination Report -------------------------- SP reported that Neasan O'Neill was preparing for the EGEE User Forum with a UKI stand. He was also working on a press release for EGEE. Gridguide.org was now available and online. If any UK sites want to be featured on this site they should contact SP. SP advised that Gridguide.org was a dissemination tool containing information about 20 sites at present. It can be highlighted at conferences. REVIEW OF ACTIONS ================= 332.1 AS to provide a plan for the tape drives, given the new information from IB - an detailed plan was required immediately, showing minimum spend relating to no data etc. O N G O I N G. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. O N G O I N G. 333.3 JC and DC to write a paragraph which summarises the Tier-2 and Tier-3 positions. This need not come back to the PMB for discussion, but can be circulated once done. Response from DC awaited. A summary should be written down so that we have something formal to refer to. JC reported that he had iterated with DC and a paragraph was currently with DC for approval. O N G O I N G. 334.1 ALL: to provide early drafts for the Quarterly Reports. Required immediately please. The PMB were warned that individuals would be 'named & shamed' next week. D O N E, item closed. ACTION 337.7 DC to provide the CMS report to SP for the Quarterly Reports; A McNab to provide the Security Report. 334.2 RJ to confirm by email when the proposed three days for GridPP25 at Ambleside had been booked. D O N E, item closed. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. O N G O I N G. ACTIONS AS AT 09.02.09 ====================== 332.1 AS to provide a plan for the tape drives, given the new information from IB - an detailed plan was required immediately, showing minimum spend relating to no data etc. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. 333.3 JC and DC to write a paragraph which summarises the Tier-2 and Tier-3 positions. This need not come back to the PMB for discussion, but can be circulated once done. Response from DC awaited. A summary should be written down so that we have something formal to refer to. JC reported that he had iterated with DC and a paragraph was currently with DC for approval. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). 337.1 DB to contact Malcolm Booy in relation to 'revised MoU costs' item and inform him that the error in the figures did not relate to 'costs' at all, and also did not affect the conclusion reached - the item should be withdrawn from the OC actions list. 337.2 DB to circulate the policy information re the prioritisation process to the PMB. 337.3 JC & JG to form a Working Group with Andy Richards and Dave Wallom to define which Grid services could be run by a UK NGI post April 2011. 337.4 JG to circulate details of current plan and progress in relation to FTE effort on storage at RAL. 337.5 SP to request that, via Neasan O'Neill, the web page on KE and EI be made more prominent, and maintained. 337.6 JC to follow-up at dTeam the problems/issues of two sites (UCL & Birmingham) not meeting EGEE targets. 337.7 DC to provide the CMS report to SP for the Quarterly Reports; A McNab to provide the Security Report. INACTIVE CATEGORY ================= 282.8 RM to monitor how R-GMA and networking issues impact on GridPP as matters progress. RM advised that this item should be moved to the 'inactive' category as it will develop over the coming months. RM discussed the issue with Steve Fisher and advised that support of R-GMA is required whilst APEL is dependent on it. RM reported that he has spoken to SF and there is currently no change to the R-GMA situation - process ongoing. RM advised that a small amount of effort was going into R-GMA on APEL but for the long term he wasn't sure. The item needed to be kept here for review from time to time, and required to be re-visited around Easter 2009. The meeting closed at 2.10 pm. It was noted that next week TD would Chair. DB would be in contact by mobile. The next PMB would take place on Monday 16th February at 12.55 pm.