GridPP PMB Minutes 353 (13.07.09) ================================= Present: David Britton (Chair), Tony Doyle, Andrew Sansum, Robin Middleton, Dave Colling, Pete Clarke, David Kelsey, Jeremy Coles, Roger Jones, Steve Lloyd, Tony Cass Apologies: Sarah Pearce, John Gordon, Glenn Patrick, Neil Geddes 1. Programmatic Review ======================= DB referred the meeting to the circulated email from John Womersley. DB reported that the Advisory Panel will contact us, but GridPP had been reviewed last time round. DB advised that he had contacted JW, who confirmed that we had already been reviewed, and that GridPP science will not be re-reviewed this summer, the previous ranking would stand. However, JW advised that by default GridPP4 would be considered as a new project and would go through the PPRP process, so we will need to provide information on GridPP4 in due course. RJ noted that we do need to make a case to come out of PPRP and become part of the running costs of the experiment. DB confirmed that he had raised this with JW, and had raised it with Jordan Nash also, to move GridPP out of PPRP to be reviewed elsewhere. JW had responded that as far as GridPP4 project was concerned, approval was not just in relation to exploitation, there was a significant hardware & service component that did need specialist review via the PPRP. DB noted that he would continue to try and discuss this issue with STFC, and others should take any opportunities presented. SL advised that the next fight would be against the submission of individual JISC forms. TD noted that Tony Medland knew the background and finances of GridPP - the PPRP Chair was at least an Astronomer, which would be helpful. 2. STEP'09 Postmortem Workshop =============================== DB asked if there were any high-level lessons that had been learnt? RJ reported that he had been present for the first day - RAL had done well as a Tier-1; they only needed to worry about communication between the storage elements and the Worker Nodes in the Tier-2s - this multi-dimensional space was complex. DC confirmed that this was an important area and they needed to spend time looking at it. AS advised that even if the Tier-1 analysis had gone wrong, they had met the requirement, and he asked if a 10GB backbone was a problem for the Tier-2? Or was it an issue of either cash or bottlenecks? TD advised that the experiment issues related to the running mode for analysis. AS asked whether the analysis rate per node was higher on the Worker Nodes than at the Tier-1? He also asked if the Tier-2 had hardware reconfigure plans? DB noted that we could ask this latter question in relation to the next tranche of hardware funding. JC advised that the larger Tier-2 sites should look at this. TD advised that it would take about £40k to upgrade the backbone which was a 10% effect; and how many disks were behind each server? At Glasgow they had 20TB, and the backbone was in principle 2 x 1 gig, but in practice was less when used in dual channel mode. RJ noted that Graeme Stewart was planning a Post Mortem write- up. DB advised that we needed to look ahead at the cost for the Tier-2. TD agreed - within prevailing uncertainties - and it was natural to upgrade to 10 gig. DB noted that we needed to ensure pickup on this issue on the part of the experiments, who needed to define their conclusions (ATLAS & CMS). JC noted that he had attended the PM Workshop remotely and had sent round a summary in his report. >From the ATLAS point of view, the reprocessing had worked well, but there were backlog issues. RJ also noted data distribution issues. JC advised that the efficiency of sites didn't depend on either size or number of staff. RAL had worked well. For LHCb, DIRAC had met experiment requirements ok, staging had gone well at all sites, but RAL had fairshare issues; the Tier-0 had worked well but there had not been enough overlap of activites. Reprocessing had been good at all sites, and good efficiencies. High-level issues were to be discussed at dTeam tomorrow. DB noted that there was no major response the PMB needed to make. 3. PP-only SSC - GridPP ======================== JG had asked DB to raise this issue. There were plans for a possible PP-only SSC - at least 3 partners were required, and CERN did not count as one of them. The question was: would GridPP be willing to participate? INFN were willing to get involved. RM & DC were asked what model this was? RM had not been present. DC reported that Jamie had outlined this SSC - it was not just LHC experiments, it was outside of LHC - it was a model of sending people to CERN to work, but overall seemed a reasonable approach. DB advised that in terms of UK funding, we had no new money until April 2011. The timeframe for this was next summer, but it would involve 're-badging' current effort, although this might help get support in GridPP4 for certain posts. RM noted that it was closer to the experiment support posts - we could preserve them as 'matching effort'. DK asked whether personnel had to be co-located at CERN? DB wasn't sure - he would need to discuss this with Jamie. It was agreed that it would be better to have personnel in the UK and have a support centre here. Both DC and DK advised that this should be something we should be involved in, however if it involved supporting other than LHC experiments, then going to CERN was not applicable. ACTION 353.1 DB to contact Jamie & express UK interest in PP-only SSC - to re-badge effort until 2011, and not involving transfer to CERN - we would want personnel to be UK-based. This would help continue the experiment support posts in GridPP4. 4. Week's Notes ================ - AHM - DB asked what abstracts had been submitted? There had been three submitted from Glasgow (Andrew Pickford, Sam Skipsey & Stuart Purdie); there had been one from DC. DB asked that those involved should let SP know. - Hardware - DB noted no info on the outcome of the LHCC review. There had been disagreement, and another meeting was taking place - did anyone have any further info? No-one had. It was noted that there would be no MB meeting this Tuesday. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) Acceptance tests of the FY08 delivery are underway. 2) A site network upgrade was carried out on 7th July (also some internal Tier-1 firmware updates to switches). Following this we experienced considerable problems with CASTOR. Some of these were related to the network outage and some related to an unrelated database node crash some time earlier. http://www.gridpp.rl.ac.uk/blog/2009/07/07/day-full-of-problems/ http://www.gridpp.rl.ac.uk/blog/2009/07/09/castor-problems-tuesday-wedne sday-of-this-week/ the upgrade was not completely successful and a further intervention on the main site router may be required. 3) Interventions to the robotics are now complete. Non HEP media remains in the HEP robot but will gradually migrate across as it is used. Staffing: 1) The first experiment support post has been accepted and is making progress - expected to start in early August. The second post has interviewed and an offer is being prepared. 2) The EGEE PPS recruitment has been re-authorised and advertising is commencing.. 3) The YII student (funded by ESC) will start this month. Service: 1) SAM availability for the OPS VO was 92% 2) CASTOR a) There is an upgrade scheduled today (Monday) to apply patches for the bigid problem service is "at risk" b) CASTOR will be upgraded tomorrow (Tuesday) to 2.1.7-27, a batch system drain has commenced. c) Also scheduled on Tuesday is a change to the CASTOR configuration to remove the dependence on NFS for LSF. This may improve CASTOR stability in the face of network breaks. 3) The ATLAS LFC will be seperated from the general LFC nexyt Monday (20th July). SI-2 ATLAS weekly review & plans --------------------------------- RJ was now absent from the meeting. SI-3 CMS weekly review & plans ------------------------------- DC reported that the Tier-1 was looking good as per the dashboard. Imperial had problems last week due to power supply tests. Other than that, things were fairly quiet. SI-4 LHCb weekly review & plans -------------------------------- GP reported as follows: 1) Many Grid sites (Tier-2s so far) banned because they have been aborting pilots. Sites banned in the UK are Brunel and RAL-HEP. Possible bad publishing of information in the bdii, but investigations going on. 2) Other UK issues. LCG.UCL.uk - Waiting for new CE to be brought online? GGUS ticket 48802. LCG.Cambridge.uk - Problem with site server certificate. LCG.LCG.UKI-SCOTGRID-ECDF.uk - Site still failing SAM jobs. Outlook: 1 Billion event (minimum-bias) production proceeding - about 845 million events produced so far, at a rate of about 50 million events per day. User analysis as usual. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) There was a GDB at CERN last week (http://indico.cern.ch/conferenceDisplay.py?confId=45477). John's introduction included a brief summary of a virtualisation and multicore workshop held at CERN recently - many open questions remain. Dave Kelsey reviewed the status of various security policies: User-level accounting, VO portals and Security Incident response policies are now all under final call for comments. The VO registratation and VO management policies have now been approved. M. Litmaath gave an update on glexec/SCAS and pilot jobs: glexec code still has bugs to be addressed and more testing is needed.Integration with ARGUS (the new authZ framework) is underway. For pilot jobs the main news was that the myProxy server has now been rebuilt with support for VOMS attributes. Chimera migration was discussed again - led by Michel Jouvin. FZK, SARA and CCIN2P3 all potentially affected by the known scaling problems with pnfs. There was disagreement about how sites arrived this close to data taking with a major update needing to be done. Jamie Shiers looked at Tier-1 performance metrics with an attempt to list four criteria (such as meeting pledged resource levels) that a Tier-1 could use to say that it is "doing ok". Andreas Unterkircher once again presented on the status of SL5. A meta-rpm that pulls in all required dependencies to run SL4 apps on SL5 64-bit has been produced and core components (e.g. POOL, COOL...) tested. Experiment testing less clear. A gLite 3.2 (SL5) UI and BDII has been released. The afternoon concentrated on EGEE operational tools. For the overall architecture and components see https://twiki.cern.ch/twiki/bin/view/EGEE/MultiLevelMonitoringOverview. The main parts were 1) Regionalised SAM monitoring: infrastructure will be NGI managed. New talk of ATM (Aggregated Topology Provider) which contains topology information of projects, grid-infrastructures, sites, services etc. Some already in place but no history. Also plans for an MRDB - Metric Results Database. SAM tests will migrate to Nagios and there will be a new SAM portal. 2) Distributed GOCDB: GOCDB4 = keep central service + build a sustainable regionalised architecture. Expect a change by November. 3) gstat2.0: tightly coupled with Nagios and uses Django for webpages. Runs from any BDII. 4) Accounting: Main new concept is that of an "Accounting Data Center". Future will see "regional accounting centers" and a move towards ActiveMQ in place of R-GMA. Installed capacity update - fewer sites publishing "zeros" for their capcity figures. Sites should now be measureing HEPSPEC2006. Sites need to upgrade CE info providers and YAIM. Finally there was a VOM(R)S working group wrap up talk: the group brought together developers of VOMS/VOMS-ADMIN and VOM(R)S.Concerns remain since many VOs use VOMS-ADMIN which is not compliant with JSPG policies and future interworking between products is dependent on individual tests. There will probably not be an August GDB but it depends on issues arising at the STEP09 workshop. 2) There was a two-day workshop following up on STEP09: http://tinyurl.com/nfz2z5. There were talks from each of the experiments, overviews for the T1 and T2s and infrastructure wide services plus monitoring. We will discuss this at tomorrow's DTEAM meeting but are there any areas on which the PMB wish us to focus? Here is a quick summary of some of the main areas: 94 people registered. HEP related sessions scheduled for EGEE09. ATLAS: STEP09 successful. Reprocessing from tape works well with CMS also active. RAL performance good. Data distribution generally good but backlogs can develop quickly. ATLAS central services worked well. Analysis rates at T2s encouraging but further tests needed. No strong correlation between T2 storage size ad number of events analysed! CMS: Generally positive outcome. Experience at large scale production gained and CRAB-server managed 130K jobs/day. No negative side-effects seen from bringing in a broader selection of T2 sites. The analysis pledge was easily met. Good tests of tape systems but further exercises needed to better tune infrastructure. Most T1 sites showed "an impressive operational maturity" and demonstrated good scalability and load tolerance. For some T1s, tape family setup is not optimal. For RAL "excellent in operations, quick and proactive to all issues", good logging, fair-shares worked and RAL tape could cope. ALICE: STEP09 activities all successfully completed. Grid and interactive user analysis is now routine and gaining momentum. Re-pro from tape being tested in Aug/Sept. Analysis train is efficient. LHCb: FTS data transfers successful. Data access remains a concern. Oracle access via CORAL imposes too high a load on the LFC so a workaround is in place (at job init stage). DIRAC was demonstrated to meet the experiment requirements. Generally data staging was smooth. T0-T1 transfers good. Tier-1s: Many had only small problems. RAL had smooth operations and a low load on most services. RAL tape sustained 500 MB/s during peaks. - but drive averages less than expected. RAL batch SE->WN peaked at 3.7 GB/s but averaged 375MB/s during reprocessing. Problem with fairshares addressed (related to 3GB vmem requirements). Overall STEP: inefficiency due to tape seek times non-negligible, CPU efficiency needs to be better understood. Some sites did not meet all the targets but will re-run some tests. Tier-2s: Tier-2 contacts list did not work well for getting input for talk! STEP09 found to be useful for T2s due to analysis loads - competition with production interesting. Noted that for ATLAS 50% of analysis is done by 11 sites. There was a shortage of storage resource - some T2s had delayed procurements. Many sites affected by transfer instabilities and lcg-cp timeouts. Some data access and CPU efficiency problems noted - difficult to interpret some of the results as the analysis changed during STEP. Intra-VO fairshare seen to be a problem once there is competition for resource. Tier-0: Noted that experiment ops run by a few experts. Question about what T0 should expect in terms of analysis activity. Ramp up generally smooth where areas had been previously tested but some problems with new activities like LHCb's LFC usage. More overlap of T0 activities would have been useful but overall T0 worked at required levels. Reprocessing: ATLAS found PanDA and DDM workflow worked fine. Running simulation and reprocessing together can cause problems with job slots being blocked. CMS found all but one site coped with data staging - but noted that competition with ATLAS low due to them stressing the disk pools not MSS. Repro overall was smooth and sites achieved good efficiency when processing from disk - without pre-staging performance was much worse. Middleware: Some problems seen for example globus-gma (RAL, DESY, CERN..) issue on LCG-CE. Too many concurrent jobs submitted to a single local account. Mapping problems seen at some sites. Talk mentioned upcoming changes. In other areas such as glexec and a more robust BDII. There were a few talks on monitoring; most indicated that the activity was going in the right direction. 3) The EGEE site availability and reliability report for June has been published (linked from here https://edms.cern.ch/document/963325/). UKI achieved 96% for both availability and reliability. The RAL TIer-1 availability was down to 81% due to the move, but the biggest factor for the figures was UCL-CENTRAL's 13% availability and 18% reliability as they continued to struggle with CE issues. Proxy timeouts due to their cluster being full also seemed to impact the results. SI-6 LCG Management Board Report --------------------------------- DB noted issues had been: SSC update; CMS Quarterly Report (including concerns about several Tier-1 performances (not RAL)); Update on gLexec / SCAS. SI-7 Dissemination Report -------------------------- DB noted that he had prompted Neasan O'Neill to put up a press release on the website last week. REVIEW OF ACTIONS ================= 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). JC reported that there had been a change of responsibility within EGEE - he was checking on the best person to ask. JC reported that the process goes through the EGEE Broadcast system - it would be possible to monitor EGEE broadcast announcements to the list and allocate issues as appropriate to dTeam or to the PMB depending if it related to operational or policy areas. JC to document the process so that it is available through the GridPP help pages. DONE, item closed. 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. AS had sent a reminder. ONGOING. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - he needed to extract data but was busy with STEP at the moment. ONGOING. 350.5 JC to check and verify that the contact list on the GOCDB is up-to-date - to be done by September. In progress - ONGOING. 351.3 ALL: drafts of all reports for the OC to be sent to DB by 10th August (project status papers) and 17th August (all other papers). 352.1 PG (JC) was asked (in lieu of JC) to note to dTeam that the revised security policy documents were on their final call - deadline for major objections was 14th July. DONE, item closed. 352.2 PG (JC) to raise at dTeam the issue of HEPSPEC06 benchmarking and corrected accounting records. DONE, item closed. ACTIONS AS AT 13.07.09 ====================== 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. AS had sent a reminder - no reply as yet. 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - he needed to extract data but was busy with STEP at the moment. 350.5 JC to check and verify that the contact list on the GOCDB is up-to-date - to be done by September. 351.3 ALL: drafts of all reports for the OC to be sent to DB by 10th August (project status papers) and 17th August (all other papers). By next PMB this will be a pressing issue. 353.1 DB to contact Jamie & express UK interest in PP-only SSC - to re-badge effort until 2011, and not involving transfer to CERN - we would want personnel to be UK-based. This would help continue the experiment support posts in GridPP4. The next PMB would take place on Monday: July 27th at 12:55 pm. Agreed further dates were as follows: July 20th - canceled July 27th - PMB Aug 3rd - canceled Aug 10th - PMB Aug 17th - canceled Aug 24th - PMB Aug 31st - canceled (UK Bank Holiday) Sep 2nd - special PMB to address issues re OC meeting Sep 7th - F2F at Cambridge