GridPP PMB Minutes 342 - 23rd March 2009 ======================================== Present: David Britton (Chair), Tony Doyle, Sarah Pearce, Andrew Sansum, Dave Colling, Roger Jones, Glenn Patrick (Suzanne Scott - Minutes) Apologies: David Kelsey, Steve Lloyd, Tony Cass, Robin Middleton, John Gordon, Jeremy Coles, Neil Geddes, Pete Clarke 1. Tier-1 Posts ================ DB reported that there was a moratorium on RAL PPD posts which had been communicated to Group Leaders. This meant that there was a further threat of delay for the 2nd experiment support post at RAL. After discussion with Norman McCubbin, DB would contact Richard Wade direct. It was noted that the post had originally been planned to begin in September 2007. DB would circulate a draft letter. ACTION: 342.1 DB to circulate a draft letter from GridPP to Richard Wade in relation to the pending 2nd experiment support post at RAL [done following the meeting]. 2. F2F Agenda ============== SP had circulated a draft Agenda for the PMB F2F meeting at GridPP22 @ UCL - any comments/changes to be advised to SP. DB advised that we needed to start thinking about the shape of GridPP4 - what would a proposal look like? It would be ideal to discuss this issue prior to lunch on the Tuesday. Furthermore, DB asked whether there were any issues arising from the Quarterly Reports? Any particular areas to target shown by the Project Map? SP agreed to look at this and report-back. It was noted that Dave Wallom would not be attending. ACTION: 342.2 SP to add-in a morning discussion time about GridPP4 to the F2F meeting; also to check if there were any issues arising from the Quarterly Reports that needed to be discussed. 3. CERN Network Maintenance ============================ DB asked if there had been any issues arising from this? AS reported no real problems - they had lost the monitoring infrastructure but it had all worked fine on restart. 4. Week's Notes ================ DB advised that Cambridge had finally been confirmed for Clare College for GridPP23. Boston Ltd, sponsors of GridPP22 at UCL, would be raffling a laptop at the Collaboration Dinner. Re the NGS Board Meeting at Birmingham, it was noted that Tier-1 had now applied to become an affiliate member of NGS. DB asked what was the measure of success for sites becoming members? And the impact on users? AS advised that it wouldn't be a drain on resources as there were no users. Were the VOs to be approved only GridPP VOs? DB advised that it would be useful to discuss this issue with Andy Richards when at UCL. DB advised that the NGS website might be revamped - there should be a link between GridPP & NGS in terms of user support and consistency of info - this had been suggested to Dave Ferguson and Neasan O'Neill. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) R89 Machine Room. The situation is evolving rapidly and we may have more news shortly. 2) Migration to R89. Planned for approximatly 2 weeks commencing 22nd June (provided R89 is available). Possible alternative plan B date for 1 week migration of critical components only - commencing 6th July. The building must be accepted by 1st May in order for us to schedule the machine room migration for 22nd June (our planned - latest possible date). 3) Disk and robotics deliveries are pending on R89 availability. CPU deliveries will soon be in the same situation. We plan to guarantee delivery of disk, CPU and robotics into either R89 or ATLAS for delivery no later than 30 May 2009. We are commencing work to install power in ATLAS centre in order to be able to do so. 4) Puchasing of remaining items on spend plan is progessing well. Staff: 1) Recruitments outstanding: a) 3rd team member of production team has started today (23rd March). b) Experiment Support posts (50% funded by T1 and 50% by experiments). Here covering T1 effort only. recently interviewed. i) 0.5 FTE Just about to make one formal offer subject to work permit. Outstanding issues resolved. ii) Did not offer on second position, but are now in informal discussion with likely applicant. Further complications owing to PPD recruitment freeze - it's unclear how this recruitment can be progressed at present. c) Final drafts for the post to recruit additional CASTOR effort is underway. Service: 1) SAM availability last week was 99%. 2) CASTOR a) We had a major CASTOR failure this morning (now restarted). Fault relates to a physical problem in the database hardware.Investigations underway. b) We are making little progress casing the big ID problem, but are continuing activly to persue this problem. It continues to impact the service. c) We have deployed a gridftp process killer to kill hanging griftp transfers (particularly impacted LHCB) d) We have ceased to activly persue the crosstalk problem (text from Matt Viljoen): In September 2008, 14,000 files were accidentally deleted due to a new ORACLE problem dubbed crosstalk. For details see http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20080917. This was triggered by the CASTOR catalogue synchronization process which makes file lists internally consistent within CASTOR and prevents a buildup of dark files - old junk files which should be removed. A service request was raised ORACLE ("SQL executing in wrong schema", SR 19633370.6) but since the bug was not reproducible nothing was done about the service request. As a result of this incident, synchronization was turned off and work has been done to try to recreate this on the pre-production CASTOR system at RAL. This has so far been unsuccessful, and as we approach data taking it is becoming increasingly important for us to turn synchronization back on so we do not have storage issues in the future. CERN have proposed a simple fix in CASTOR that turns off synchronization between the stager and the diskserver (which resulted in the files being deleted) but keeps synchronization between the stager and the nameserver running (which removes dark files). This fix has been introduced in the latest 2.1.8 version. Should we upgrade to 2.1.8 before data taking, we will have this fix, and if we do not, CERN will backport the fix to 2.1.7. Anyhow, not spending more time trying to recreate the crosstalk problem will be able us to channel more effort in preparing for data taking and upgrading to 2.1.8. e) A major hardware upgrade will be needed on the CASTOR core database hardware. No date has been fixed (pending tests). 4) WMS Instability - A cron restarter now imimises the impact of WMS problems. 5) The top level BDII failed on Friday - it was down for about 1 hour. 6) We have applied for NGS associate member status. 7) A user satisfaction survey will be posted this week. SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that there had been interruption with the BDII; reprocessing hadn't started yet - there was a two-day delay; nothing further at present. SI-3 CMS weekly review & plans ------------------------------- DC reported that parallel meetings were happening; CMS would stick to the current reporting format and the Quarterly Report format submitted. There had been an issue with the Tier-2s in terms of dCache not being scaleable and there had been issues running jobs. There had been difficulty in coping with a peak of 1200 jobs, which required upgrading of the headnode and versions - this might mean a possible move to Chimera - meantime they would be limiting the number of analysis jobs to 500ish. There was a parallel process with DPM and separation of services. DB advised that the report format was good and should be continued likewise. SI-4 LHCb weekly review & plans -------------------------------- GP reported as follows: Not a lot to report because of CHEP, etc. Main points: 1) Job successfully ran LHCb application at NIKHEF using glexec. 2) Castor at RAL. Preference of LHCb is for RAL to wait until CERN deploys version 2.1.8 in full production and it is found to be stable there. Outlook: User analysis/production. SI-5 Production Manager's Report --------------------------------- JC was absent, brief report: At the WLCG workshop yesterday there was talk of a CCRC09 - though CMS request it to be called STEP09. The proposal is to try and run many parallel activities in June. The exact dates have yet to be agreed but the main work is likely in the 2nd week of June. Have a look at the slides from Kors and Matthias at 15:00 here: http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=16861 SI-6 LCG Management Board Report --------------------------------- DB had not attended. SI-7 Dissemination Report -------------------------- SP reported that Neasan O'Neill was at CHEP. The stand backdrop had not yet arrived but the screen with the RTM was there. SP advised that they had recruited a new employee to replace Ms Burne who would be going on maternity leave. REVIEW OF ACTIONS ================= 332.1 AS to provide a plan for the tape drives: The experiments would be producing new numbers by ~end of March which would enable this action to progress. O N G O I N G. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. PC had sent info round, and Robin Tasker would provide further info at GridPP 22 UCL. The issue would need to be referred back to the OC. O N G O I N G. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). O N G O I N G. 337.4 JG to circulate details of current plan and progress in relation to FTE effort on storage at RAL. O N G O I N G. [Done following the meeting]. 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. O N G O I N G. 341.1 AS to review reslience of services that may have to remain in the ATLAS building. On hold at present until further info available. 341.2 RM to contact Ben Waugh about room and phone/video facilites for this meeting. Done, item closed. 341.3 NG to consult/inform GRIDPP PMB on responses to EU questionaires. O N G O I N G. 341.4 JG to establish new date for the wLCG-ORACLE meeting. O N G O I N G. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). O N G O I N G. 341.6 SP to relate milestones/metrics to individuals where possible - this was explicitly for storage, and would be done once the new person was in post. DB noted it was a good idea generally. SP already has 'responsible' people noted on the Project Map. Done, item closed. 341.7 SP to chase non-functioning of GridMon with Robin Tasker and report back. O N G O I N G. 341.8 JC to suggest a more specific agenda item for the UCL DB meeting about sharing T3 resources with T2. Done, item closed. ACTIONS AS AT 23.03.09 ====================== 332.1 AS to provide a plan for the tape drives: The experiments would be producing new numbers by ~end of March which would enable this action to progress. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. PC had sent info round, and Robin Tasker would provide further info at GridPP 22 UCL. The issue would need to be referred back to the OC. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. 341.1 AS to review reslience of services that may have to remain in the ATLAS building. 341.3 NG to consult/inform GRIDPP PMB on responses to EU questionaires. 341.4 JG to establish new date for the wLCG-ORACLE meeting. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). 341.7 SP to chase non-functioning of GridMon with Robin Tasker and report back. 342.1 DB to circulate a draft letter from GridPP to Richard Wade in relation to the pending 2nd experiment support post at RAL [done following the meeting]. 342.2 SP to add-in a morning discussion time about GridPP4 to the F2F meeting; also to check if there were any issues arising from the Quarterly Reports that needed to be discussed. INACTIVE CATEGORY ================= 282.8 RM to monitor how R-GMA and networking issues impact on GridPP as matters progress. RM advised that this item should be moved to the 'inactive' category as it will develop over the coming months. RM discussed the issue with Steve Fisher and advised that support of R-GMA is required whilst APEL is dependent on it. RM reported that he has spoken to SF and there is currently no change to the R-GMA situation - process ongoing. RM advised that a small amount of effort was going into R-GMA on APEL but for the long term he wasn't sure. The item needed to be kept here for review from time to time, and required to be re-visited around Easter 2009. The meeting closed at 2:00 pm. There would be NO MEETING on April 6th (just following UCL). The next PMB would take place on Monday 20th April (following the Easter weekend break).