GridPP PMB Minutes 347 - 14th May 2009 ======================================== Present: David Britton (Chair; minutes), Sarah Pearce, Andrew Sansum, Glenn Patrick, Robin Middleton, Jeremy Coles, John Gordon, Steve Lloyd, Roger Jones, Dave Colling. Apologies: David Kelsey, Tony Cass, Tony Doyle, Pete Clarke, Neil Geddes. 1. JRU and EGEE PMB Meetings ============================= RM reported on EU meetings last week. At the JRU on Friday, NG gave an update on the status of NGIs in Europe. Many in a similar situation to the UK and using their JRUs as an interim body. In the UK, JISC is happy for the NGS to morph to become the NGI. STFC are happy for GridPP to use up to 10k-euros from the Working Allowance towards the EGI start up costs; remaining contribution will come form JISC and buys the UK a voice. There was a workshop on SSCs (Specialized Service Centers) in Athens. There will be an HEP SSC possibly involving CERN, INFN and DESY. The question of which other SSCs the UK should be involved with, was raised. JG remarked that more science based people need to be involved with the SSCs. There was no decision yet to go to a no-cost extension for EGEE but this is still under review. 2. GDB Issues ============== JG reported on the GDB and Pre-GDB meetings. Most countries indicate that they can continue to provide the Grid Services required by the LHC in the absence of EGEE. However, it is clear that the loss of EGEE posts is likely to effect the level or quality of those services. The transition to SL5 was discussed and is complicated. JG is actioned to write this up for the wLCG MB. The experiments have a variable degree of readiness for SL5, with ALICE being the most advanced and ATLAS perhaps the least. All experiment want to move but it will not be before STEP09. This suggests July/August but will want to do it using a new CE so a choice remains for the experiments. It's then a question of balancing the resources behind each of the CE's to meet load. SL commented that the lack of SL5 build machines has compromised ATLAS progress. Expected data-flows were presented at the GDB. Some of these numbers are still large and further refinement is expected, particularly in advance of the RRB in June (18th). Sites were asked abut SCAS/GLEXEC testing. Apparently the problem is now with GLEXEC rather than SCAS and will be raised at the MB. Other issues touched upon were the reorganization of the ticket triage system and the metrics for STEP09. 3. Tier-1 Hardware Schedule ============================ STFC requested information on the Tier-1 HW spend schedule in light of the new LHC schedule and the delay in moving to the new machine room at RAL. DB had drafted a statement to STFC which was discussed during the PMB. In essence, GridPP makes the HW purchase towards the end of each Financial Year in order to meet the April pledge dates to the LHC. Currently, there is no agreement between the CERN Resource Review Board, the Resources Scrutiny Group, and the Experiments on the implications of the new schedule. This will be resolved by early summer but GridPP does not currently have the information necessary to re-assess, and has thus started to initiate the next purchase as planned but with an option to delay at some point. Given the request from STFC, and cognizant of funding issues in the UK, GridPP will now consider splitting the next Tier-1 Hardware purchase into two tranches to arrive by April and by September 2010. The first of these tranches would be paid for in FY09 as planned but the second could now be paid for in FY10. ACTION: DB to respond to STFC w.r.t the Tier-1 Hardware schedule. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric ====== 1) Migration to R89 is expected to commence on Monday 22nd June and take 2 weeks. A detailed shutdown plan has been circulated to the experiments. I will circulate to the PMB seperatly and a blog item is being produced. 2) Disk, CPU and robotics deliveries are being scheduled. We expect to install w/b 18th May. 3) The network change to the core network attempted on Tuesday 13th May failed after the routers responded unexpectedly. A further attempt will be made in w/b 25 May - downtime may be several hours to allow more time for testing/remedial action. 4) The site needs to carry out a major network upgrade (to remedy problems they have had for over a month). It will take over an hour (maybe more) and cannot wait for the Tier-1 migration to be in progress. We are trying to avoid STEP but this may not be possible. 5) CERN have prioritised T0-T1 traffic over T1-T1 traffic. 6) We expect to start bad block scrubbing on the disk arrays shortly. We expect that this will reduce the incidence of multi-drive failures but will have a long term impact on performance. In the short term there is likely to be a spate of drive failures as the scrubbing uncovers undetected problems. Staffing ======== 1)The first experiment support post has been accepted. The second post is advertised. 2) The EGEE PPS recruitment failed and we are seeking authorisation from STFC to re-advertise. 3) The YII student (funded by ESC) is expected to start in July. 4) The CASTOR d/b admin is approved! Service ======= 1) SAM availability last week was 98%, 2) CASTOR a) The ORACLE database RAID array upgrade tests have now been carried out. We expect the upgrade to take 2 days and is scheduled for 18-19th May and will affect (downtime) all CASTOR instances. b) Work continues locally on a BIGID workaround. Oracle have reproduced the problem. 3) The FTS and LFC upgrade went well. faster than advertised and with no known problems. 4) The reconfiguration of the CEs ran into problems when a change of gid for CMS users caused permission problem on CASTOR (which does not support secondary groups). CMS work and availability was impacted. A workaround has been put in place and the remaining upgrades rescheduled. 5) Work on an SL5 service is underway. A test service has been advertised to the experiments. 6) Production operations are gradually becoming more reliable. For the first time we exceeded 7 days with no high priority (pager) callouts day or night. SI-2 ATLAS weekly review & plans --------------------------------- RJ's report is at https://www.gridpp.ac.uk/pmb/WeeklyReports/ATLASReport11May09.pdf Table is taken from: http://dashb-atlas- prodsys.cern.ch/dashboard/request.py/summary?cloud=RAL&grouping=site&g rouping=cloud Glasgow and Edinburgh are being under-reported by 50% in this table. (This bug is being fixed) STEP09 preparations are progressing and that the targets were due yesterday, but I do not have them yet. There is a generic discussion about long weekends and the expert on call. There are problems contacting people at Tier2s (worldwide) at the weekend and some such as the sites that process calibration data can have problems. The expectations of the production team may be unrealistic (=beyond the MoU) concerning T2 response during data taking! We are looking into raising the cap for AOD sides to 8-10 GB to improve transfer times. Checking with fts team if a more flexible timeout mechanism can be implemented. We are planning how to respond if a Tier 1 storage is down for 2 weeks+ and how data gets distributed to the Tier 2s in the Cloud. No conclusion reached yet! On 8/5/09 DQ2 server was down. It affected a lot of users and it was a useful learning experience especially regarding the way things were communicated between developers and users. SI-3 CMS weekly review & plans ------------------------------- DC's report is at: https://www.gridpp.ac.uk/pmb/WeeklyReports/CMS%20Summary%2014%20 may%2009.pdf Summary of RAL activity given to the FacOps meeting on Monday ¥ Not a good week, the update of one of the CEs the week before last introduced pool accounts for production roles, unfortunately we didn't realise that CASTOR does not make use of the account secondary groups (dCache is that same) and so the CMS Production accounts were not part of CMS as far as CASTOR was concerned and so stageout started failing. It took us a while to actually work out what the problem was and several attempts to get the mappings and namespace permissions correct. Are other Tier 1s using pool accounts for production? Have they seen similar issues? SI-4 LHCb weekly review & plans -------------------------------- GP reported as follows: Last week 20M events produced and merged into 5GB files. Merging fine at RAL, but problems seen at CNAF(GPFS/Storm) and GridKa(dCache+4GB file-size limit on w.nodes). "Long" opening of files at RAL Castor still under investigation. Some problems this week with ~60% of LHCb jobs stalling - looks like it was an LSF problem on Castor causing problems on ~60% of LHCb servers (discovered Wed night). The CE in front of test SL5 service at RAL (lcgce07.gridpp.rl.ac.uk) was made available on Tuesday (12th) this week. LHCb found same problem with missing library dependencies (/usr/lib64/libldap-2.2.so.7) as at CERN and GridKa. DIRAC3 accounting (froml 1 Jan) shows UK sites(42%) have been by far the largest CPU resource for LHCb. France was next (14%). Outlook: Analysis and understanding of 20M events produced last week. Once events are certified, MC09 production of 10**9 events can start up. New versions of LHCb application software to be released - will also be compiled on SL5 at CERN. Work on understanding compatibility libraries ongoing over next few weeks. Work ongoing to test pilot jobs on sl5 worker nodes at GridKa (and maybe RAL). This week is also a Fest week and the first data was "injected" into the processing chain on Wednesday (13th). SI-5 Production Manager's Report --------------------------------- 1) Several ATLAS Tier-2 sites are running out of mcdisk space. Current proposals are that disk at these sites is increased by 5-10TB but the ATLAS UK team know that deletion of some data ahead of MC09 is going to be required. ATLAS will be deleting unmerged AOD now that merged datasets are available. I believe this is in hand (a review of resources on a site-by-site basis is about to start), but the PMB should be aware that certain resources are saturated. 2) For STEP09, discussion on Tuesday concluded that for meaningful results sites would be best advised to wipe their maui/scheduler stats. The deployment team did not see this as a problem given current usage. This would stop the historical usage impacting the experiment desired shares for the exercise. On this topic it was also noted that we as yet have no way to distinguish between guaranteed and opportunistic shares for experiment work. ---> NOTE this was discussed and the Tier-1 did not feel it was useful to wipe the stats because the scheduler might initially be unstable and the performance unrepresentative. 3) There needs to be clear communication to sites/users with regards to outages expected as the RAL WMS machines are moved. It is expected that all will be transitioned to R89 at the same time. In theory the WMSes have redundancy since we have instances at other sites, but ensuring that UIs point to them is another matter. This is an opportunity to test the failover options. 5) The EGEE April availability and reliability report was published last week (circulated to the PMB by John). The UKI figures were 94% availability and 96% reliability. Both UCL sites showed a marked improvement in April, though UCL CENTRAL staff are still tackling frequent problems. We are working on getting correct logical/physical CPU figures published for every site. For information: A) There was a pre-GDB this week on NGI preparations in WLCG countries. Most countries seem to have something in place that will exist beyond EGEEIII, but few had a well developed NGI with a clear legal entity ready to sign the EGI MoU. The agenda is here: http://indico.cern.ch/conferenceDisplay.py?confId=56364 B) There was a GDB yesterday. The agenda is here: http://indico.cern.ch/conferenceDisplay.py?confId=45475. Core topics included an SL5 discussion and details of experiment data flows. C) DPM 1.7 is now available - but only under 3.1 on SL4. Sites like Oxford wish to rearrange their storage but currently if pools are drained, the pools loose all their space token information. D) There was a user board meeting last week: http://www.gridpp.ac.uk/eb/060509/index.html. E) EGEE09 registration is now open with "early bird registration" until June 30th: http://egee09.eu-egee.org/?id=630. F) Next week is the last week that we contribute to a global COD effort. >From mid-June a regional on-duty model will be in place. G) There has been a first meeting of the UK CA TAG.. SI-6 LCG Management Board Report --------------------------------- DB noted that he had explained the apparent downtime at RAL for ALICE prior to the meeting (SAM test was not pointed at the CREAM CE that ALICE had moved to). He had also queried a comment about some sites not adequately reporting/explaining downtimes and had been assured that RAL was doing all that was required. There had been an update on EGI (largely covered by first item in this meeting); further discussion about the mandate for the wLCG Technical Forum; and a proposal for MSS metrics dashboard. The latter will be discussed further. SI-7 Dissemination Report -------------------------- SP noted that we should go to the All Hands Meeting in Oxford in December as an NGI but will be required to pay some of the fee for a stand as STFC now had no budget in this area. The call for AHM abstracts is imminent and we need to decide whether to submit a GridPP paper, again. REVIEW OF ACTIONS ================= Due to the lateness of the hour, the actions were not reviewed. ACTIONS AS OF 14.05.09 ====================== 332.1 AS to provide a plan for the tape drives: The experiments would be producing new numbers by ~end of March which would enable this action to progress. AS now has info provided by CMS and Graeme Stewart - he will begin work on this soon. 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. 341.3 NG to consult/inform GRIDPP PMB on responses to EU questionaires. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). 343.7 AS to look into the CASTOR/ATLAS instance availability (in the light of the experiment red metrics). AS had spoken to GS & RJ and asked for a breakdown to achieve consistency - responses awaited. 345.1 JG to speak to RT regarding GridMon and GridPP funded network effort. 346.1 DB to check the current situation re website work and raise the whole NGI website issue with NGS. 346.2 DB to raise the issue of SL5 and 32/64-bit middleware at MB & GDB level, for policy decision/direction. 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. 346.4 JC to summarise the responses from Camont for the PMB in relation to security, reliability, commercial use of networks, and suitability/acceptability of content. 347.1 DB to respond to STFC w.r.t the Tier-1 Hardware schedule. The meeting closed at 2.50 pm. It was confirmed again that the next PMB meetings in May would take place on Thursday 21st. There would be no PMB week commencing 25th.