GridPP PMB Minutes 335 - 26th January 2009 =========================================== Present: David Britton (Chair), Tony Doyle, Sarah Pearce, Roger Jones, Jeremy Coles, Steve Lloyd, Robin Middleton, Glenn Patrick, John Gordon, Andrew Sansum, Tony Cass (Suzanne Scott - Minutes) Apologies: David Kelsey, Pete Clarke, Dave Colling, Neil Geddes 1. International Review of e-Science ===================================== There was a discussion regarding the nomination of candidates. It was noted that the PMB had approached and received agreement from two candidates for nomination. It was understood that anyone else who wished to, could also nominate an individual. 2. Quarterly Reports ===================== SP reported that the deadline was last Friday. She had received apologies from JC, and an initial draft from TC. SP asked for early versions from everyone immediately please. ACTION 334.1 ALL: to provide early drafts for the Quarterly Reports. Required immediately please. 3. GridPP22 & 25 ================= DB had circulated a draft Agenda programme for GridPP22. The programme was agreed. RJ would email & confirm when the three days for GridPP25 at Ambleside had been confirmed. ACTION 334.2 RJ to confirm by email when the proposed three days for GridPP25 at Ambleside had been booked. 4. EGI ======= NG had circulated an email with notes from the EGI meeting. It was noted that JG had been elected Chair of the Selection Committee who would decide the location of EGI.org. STANDING ITEMS ============== SI-1 Tier-1 Manager's report ----------------------------- AS provided the following report: Fabric: 1) Our assesment continues to be that the machine room will be available February 9th. I circulated a status update to the PMB last Wednesday. There remain a few show stoppers but prgress is being made in other areas and our assesment remains the same (80% likely to be available by 9th February). 2) We plan to deliver the disk as early as possible into the new machine room. However acceptance will not be completed until after the end of the FY and thus VAT will be accrued into next FY. This is reflected in our revised Outturn Forecast to STFC. 3) Delivery of the new robot will now be scheduled to complete as soon as the building becomes available. 4) We plan to deliver the CPU into the new Machine room in late February. However acceptance will not be completed until after the end of the FY and thus VAT will be accrued into next FY. This is reflected in our revised Outturn Forecast to STFC. 5) I have been working on tape drive and capacity planning. New roadmaps for the robot will be available to us this week. My initial assesment is that tape drives are definitely not required this FY. Various options exist in future years and careful consideration of the these will be needed before new plans are formed. 6) Puchasing of remaining items on spend plan is progessing. Staff: 1) Ian Collier has started as senior systems admin in the Fabric team. 2) We failed to appoint the Production team member and recruitment will commence again this week. Internal candidate is not interested in this role. 3) We are working on the paperwork for the EGEE funded position (PPS) and are in discussion with an internal candidate. 4) Experiment Support posts have shortlisted. Interviews 2/3 February. DC, RJ, AS & NMcC will be interviewing. Service: 1) SAM availability last week was 100%. 2) CASTOR a) RAL performed well in the ATLAS 10M test. Recent changes to the SRM configuration (more capacity, timeout changes, bug fixes) made a huge difference and most problems seen at the RAL end were now elsewhere than the SRMs. For the duration of the test we changed (doubled active files) our FTS parameters (after a couple of days) and CASTOR proved able to accept data as fast as it could be delivered. Inbound went very well. Outbound our data quality was poorer, possibly because other sites FTS timeouts were rather short to accomodate the substantial (40-60 second) latency coming from our CASTOR. b) The big ID problem has been seen at CERN and ASGC (also again at RAL) and investigations are underway once more. SI-2 ATLAS weekly review & plans --------------------------------- RJ provided the following report: A new RAL-specific data mover (using the local rfcp rather than the SRM) has been written, and seems to work better. We could say this with more confidence, but the storage fell over mid test (see below). The ATLAS OSC in their final report noted with concern that we had been driver to a site-specific patch for the production system at RAL; they are concerned this is the first chink, and that we may end up with something impossible to maintain. I stress this came from their deliberations, not from any slant in my report. The patch currently under test could be used at ASGC and CERN. Graeme thinks we need to consider pcache to cache the database release in reprocessing, but this will require the Tier 1 allowing us a 5-10GB on each Worker Node. AS noted that this did not need to come to the PMB for approval, he would iterate with RJ and could organise this ok. The back end for Castor fell over again with the Oracle large integer error once again. This has now also been seen at CERN, so one might hope it gets more attention from Oracle and a fix that really fixes it this time. The RAL response to the outage was quite swift. Three sites in the UK have not been in ATLAS production: LIV, BHAM, UCL. Liverpool was out immediately before Xmas with a hardware failure on the DPM; while it was fixed quite quickly, it is not clear that anyone was told. This is now fixed. BHAM were off with GPGS on the eScience cluster. These seem now to be resolved, and they have been put back in. UCL is off because there are no defined SRM space tokens to write into that we know of; dialogue continues. Note, the absence of the three UK sites was noticed by ATLAS-UK, not raised by the sites. There is still no torrent of production facing us. After a few ATLAS internal problems were addressed at the start of last week, the T2 started to fill up. We have 4.8k jobs running now, and 40k jobs still to be done, which will take a couple of days The meetings last week really made it clear to the MC team there is urgency to use resources while they are available. RJ advised of possible modifications to the ATLAS computing model which he would formalise to the PMB in due course. SI-3 CMS weekly review & plans ------------------------------- DC was absent. SI-4 LHCb weekly review & plans -------------------------------- GP provided the following report: 1. Failures running SAM jobs at Birmingham - identified to be a problem with a single worker node. That has been taken offline and jobs running fine again. 2. Permissions problems at various sites - being followed up with GGUS tickets (as requested by the Tier-2 administrators). 3. Progress in understanding CNAF problems after escalation. But CNAF is not yet in LHCb mask. Outlook : 1. Mostly test/dummy productions. Some user jobs. 2. FEST09 starting w/b 26 January. Likely to mainly involve testing of HLT injection mechanism and saving data to Tier-0. For now, do not expect any activity at the Tier-1s (though that may change if things go extremely well early in the week). SI-5 Production Manager's report --------------------------------- JC provided the following report: 1) An experience at BHAM recently highlights a problem currently faced when a site implements multiple CEs. When the experiment SAM tests detect a failure at the site on one CE they blacklist the whole site even though the other CEs (and the queues) are operational. We also have a related monitoring issue at Brunel this week in that the site has nodes that are only ever used for test (e.g. upgrades) but need to be in production to be part of the testing framework. Unfortunately these lead to "degraded" warnings in GridMap though SAM is not affected. Is the "image" from GridMap worth the effort of seeking change in this area? Should the experiment blacklisting be finer grained? 2) There was a problem (key not replaced after certificate change) recently with a VOMS upgrade which led to (about 24hrs) disruption for GridPP VOMS hosted VOs. Several VOs were in contact including phenogrid, supernemo, camont and mice. An incident report is currently being compiled by the team at Manchester and will be in the wiki shortly. One area of concern is that this would have impacted sites outside the UK as some of the VOs are no longer just nationally supported. This means mails to TB-SUPPORT and GridPP-Users are not sufficient for communications. Also, some services were slow to adopt the new GridPP VOMS certificate which means for example that jobs to some WMSes failed after the event was otherwise resolved. DB noted that in order to reach non-UK users, the msg should go through the EGEE broadcast to VOs (Managers). This was the best propagation outside of the UK rather than using the worldwide list. Does anyone on the PMB have additional comments they would like included in the upcoming review? 3) At January's GDB (http://indico.cern.ch/conferenceDisplay.py?confId=45461): - there was a reminder that there will be a WLCG workshop the weekend before CHEP in March. Do we know who from GridPP is going to CHEP? - The LHC delay (till physics) will be at least until September but theactual dates will only be decided after the machine group meet in February. The delay means a slight relaxing on MoU requirements for hardware being in place. Has GridPP made clear its position? It was noted that the LHC delay had been discussed at the MB. There had been a proposal of relaxation of requirements, which was in abeyance until after Chamonix. The GridPP 2009 (April) hardware pledge was unaffected. The next procurement was still to be decided upon. DB noted that GridPP had made its position clear at the MB. - The benchmarking working group have now agreed a conversion factor of 4 between SpecInt2K and Spec2006. - Markus Schulz talked about middleware change management and a plan to have multiple software repositories to allow rollback. *The deployment team are very concerned* about this method of rollback which would leave sites that upgrade to a bad release stranded. The matter will be discussed at Thursday's UKI meeting and a response sent to Markus. - An SL5 WN release is expected in February. ALICE and CMS were keen to move. Would the experiment reps on the PMB comment on their expectations with regard to moving to SL5? What timeline should we be looking at in the UK? SI-6 LCG Management Board Report --------------------------------- DB noted the following issues which had been discussed: - SCAS was still in progress, a patch was missing for configuration. - big ID CASTOR problem had now been seen at CERN and ASGC in addition to RAL. This means that it is more likely to be solved. - dCache instability has affected some sites. - Jamie proposed that there should be reviews of the Tier-1s before data-taking (JG, Jamie & Ian will discuss - rolling 'peer review' of Tier-1s was proposed). JG noted this process related to readiness for data-taking. - a proposal had been agreed at the MB, requiring action at the Tier-1, if it was down for more than one week. - Quarterly Report from ALICE, requested additional WMS throughout the world and support for ALICE in the UK re the Cream CE. It was noted that Derek had installed the Cream CE last week at RAL, but ALICE weren't using it yet. - LHCb Quarterly Report was given but no particular comments with respect to the UK. SI-7 Dissemination Report -------------------------- Re CHEP, SP noted we had had a stand in previous years - it was good for raising profile. Neasan O'Neill had contacted the organisers and has obtained 50% off, relating to a 9m2 stand including 2 x Registrations for Euro 1000. DB asked about manpower? SP noted that NO could go. TD advised that a list of people was required in advance. This was agreed. NO needed to know who was attending in order to sort out a rota in advance. SP asked about the new RAL machine room - NO had emailed JG and AS re an opening event. It was noted that one may happen but there was no info as yet. It would be a good idea for publicity to hold some kind of event. REVIEW OF ACTIONS ================= 322.4 DB to follow up email with information on what Universities classify as consumables, (in relation to GridPP using a figure of 5% as a guide to grantholders, regarding small level of consumables allowed on the Tier-2 Grants). O N G O I N G 327.1 AS to ask RJ and other experiment reps to confirm bandwidth and storage requirement numbers, as experiment requirements were changing. AS still to report on this. AS had started work on this, emailed Kors and spoken to Raja. D O N E, item closed. 332.1 AS to provide a plan for the tape drives, given the new information from IB - an detailed plan was required immediately, showing minimum spend relating to no data etc. O N G O I N G. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. 333.1 DB to contact Les Robertson and see if he is willing to be nominated for the EPSRC Review Panel, to represent GridPP. D O N E, item closed. 333.2 DK to contact Jens Jensen about the Technical Advisory Group (TAG) - the Group needed to be formalised and an intermittent email contact was not sufficient. Quarterly updates were required. The Group must check that the CA is dealing with issues relating to the Machine Room move. D O N E, item closed. 333.3 JC and DC to write a paragraph which summarises the Tier-2 and Tier-3 positions. This need not come back to the PMB for discussion, but can be circulated once done. O N G O I N G. ACTIONS AS AT 26.01.09 ====================== 322.4 DB to follow up email with information on what Universities classify as consumables, (in relation to GridPP using a figure of 5% as a guide to grantholders, regarding small level of consumables allowed on the Tier-2 Grants). 332.1 AS to provide a plan for the tape drives, given the new information from IB - an detailed plan was required immediately, showing minimum spend relating to no data etc. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. 333.3 JC and DC to write a paragraph which summarises the Tier-2 and Tier-3 positions. This need not come back to the PMB for discussion, but can be circulated once done. Response from DC awaited. 334.1 ALL: to provide early drafts for the Quarterly Reports. Required immediately please. 334.2 RJ to confirm by email when the proposed three days for GridPP25 at Ambleside had been booked. INACTIVE CATEGORY ================ 282.8 RM to monitor how R-GMA and networking issues impact on GridPP as matters progress. RM advised that this item should be moved to the 'inactive' category as it will develop over the coming months. RM discussed the issue with Steve Fisher and advised that support of R-GMA is required whilst APEL is dependent on it. RM reported that he has spoken to SF and there is currently no change to the R-GMA situation - process ongoing. RM advised that a small amount of effort was going into R-GMA on APEL but for the long term he wasn't sure. The item needed to be kept here for review from time to time, and required to be re-visited around Easter 2009. The next PMB would be on Monday 2nd February at 12.55 pm.