GridPP PMB Minutes 346 - 5th May 2009 ======================================== Present: David Britton (Chair), Sarah Pearce, Tony Doyle, Andrew Sansum, Glenn Patrick, Robin Middleton, Jeremy Coles, Pete Clarke (Suzanne Scott - Minutes) Apologies: David Kelsey, Tony Cass, Neil Geddes, Roger Jones, Steve Lloyd, John Gordon, Dave Colling 1. Quarterly Reports ===================== SP reported that she had received many of these now - a few were still to come, from: RJ, JC, DC - and reports from TC, JG and RM were also still awaited. 2. NGI NGS-GridPP ================== JC advised that he had circulated a list of actions from the recent phone meeting. He noted that the meetings did tend to get sidetracked into other issues, for example last week they discussed deploying the CE; helpdesk issues; CIC portal and Regional COD; the dashboard; portal policy; and the website. JC reported that for sites looking to join EGEE, the GridPP wiki had previously provided all info, but NGI needed an NGI-specific website, with pointers, for now, to the GridPP website. DB advised that the NGS website was being worked on by David Fergusson - DB had advised him to liaise with Neasan O'Neill. JC noted that Claire Devereux had also been working with a summer student on website issues. DB would check the current situation. JC noted that there was the issue of NGI identity - which didn't currently exist. JC advised that Ian Bird was looking for a planning update (of around 10 minutes) from each country before the next GDB meeting. TD advised that the name NGI.ac.uk should be registered as soon as possible. ACTION 346.1 DB to check the current situation re website work and raise the whole NGI website issue with NGS. 3. Supernemo VO issues ======================= DB referenced the Supernemo email which had been circulated, informing of their experience of the Grid. GP noted that there wasn't much to add - no more info was available at present. There had been the issue of the VOMS Service Cert renewal, and DB also noted communication issues. It was agreed to revisit this in due course in relation to possible actions. 4. Week's Notes ================ DB reported on a major cooling problem at IN2P3. AS had provided a response in relation to similar potential difficulties at the Tier-1. AS noted that at RAL he would expect that the automation would detect the temperature rise at the machine room sensors and pull the plug on the power. Unfortunatly this was one of those things that was not easy to test as accidental shutdown of the whole cluster was undesirable, however the script components and dummy mode had been tested. DB asked whether, when doing the machine room move, they could test this? AS noted no, they were building a new monitoring infrastructure for the new room anyway. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) Migration to R89 is expected to commence on Monday 22nd June and take 2 weeks. We are finalising details of the shutdown/startup and will make a detailed announcement next Tuesday. 2) Disk, CPU and robotics deliveries are being scheduled. We expect to install w/b 18th May. 3) We suffered a possible recurrence of the home filesystem deletion on Thursday 30th April however this involved just one system admin directory. We are not certain of the cause but have identified one further test installation script that was not modified when the main installation system was updated. Staffing: 1)The first experiment support post has been accepted. I have been verbally notified that STFC has approved the second post for advertisement. 2) The EGEE PPS recruitment failed and we are seeking authorisation from STFC to re-advertise. 3) The YII student (funded by ESC) is expected to start in July. 4) The CASTOR d/b admin has not yet been approved for external recruitment. Service: 1) SAM availability last week was 100%, 2) CASTOR b) The ORACLE database RAID array upgrade tests have now been carried out. We expect the upgrade to take 2 days and is scheduled for 18-19th May and will affect (downtime) all CASTOR instances. 3) The FTS and LFC are planned to be upgraded to new hardware on 6th May. This will lead to a 1 day downtime of these services. 4) A rolling re-configuration/upgrade of the CE's is underway: Lcgce04 - 05/05/09-08/05/09 CMS LHCb Alice Lcgce05 - 11/05/09-14/05/09 Atlas LHCb 5) Work on an SL5 service is underway. A test service is nearly ready for deployment. SI-2 ATLAS weekly review & plans --------------------------------- RJ was absent this week. SI-3 CMS weekly review & plans ------------------------------- DC reported that it had been generally a quiet week, with the UK generally OK and Green. Because of the way that the UK use T3 sites that are otherwise T2 sites, they will have savannah squads for them (unique to UK). DC notified CMS management of the proposed move date and they were planning to have a discussion as to what this means for step09 sometime this week. CMS has just started thinking about SL5 ... no firm plans yet. SI-4 LHCb weekly review & plans -------------------------------- GP reported as follows: The only new item to note this week is that LHCb has studied the results of performing a large analysis test at the 6 Tier 1 centres plus CERN. The procedure was to take working analysis code and run over large data samples several times each week. Each submission consisted of around 600 jobs each reading 100 different files (500 events/file, av. file size=200MB). First results showed that dCache sites (GridKa,IN2P3,NIKHEF,PIC) had worse performance than Castor sites (CERN,RAL,CNAF). Solved by tuning dCache parameters. The RAL processing time was 0.10 s/event which compared well with other sites, but there was also a "large" (relative to other sites) time to open files of 4.9s (typically nearer to 1s at other sites). The reason for this is still being investigated with the RAL Castor team. SI-5 Production Manager's Report --------------------------------- 1) Several of our sites have now enabled queues to SL5 WNs. Generally these are setup for ops/dteam as there is no clear indication from the experiments that they are ready to start testing, or at least match against SL4/SL5. It has been noted that FZK were recently warned that if they did move to SL5 now then they would not be receiving as much work. Graeme mentioned in today's dteam that ATLAS have not yet started to work on SL5 testing - a concern mainly for the job wrappers more than the experiment code. The question that arises is what planning we should be doing for the switch to SL5 as well before data taking commences would seem sensible but the experiments are not ready. Only one site has indicated that they are unable to run 64-bit middleware (the only option from gLite 3.2). DB advised that this issue needed to go through the MB & GDB as a policy decision - there was no clear direction on this as yet. There followed a discussion on SL5 and high-level planning - there was general agreement that clarity was needed. ACTION 346.2 DB to raise the issue of SL5 and 32/64-bit middleware at MB & GDB level, for policy decision/direction. 2) There were problems within the core JANET network at the end of last week. Without GridMon we could not tell the impact on our site connections. The problem notification route is not clear to me - is there some active notification or do we need to keep an eye on JANET's website for information? There was a discussion of information received on JANET in the past. PC noted that receipt of tickets would be useful. DB suggested consulting Robin Tasker. PC noted he was at RAL on 21st and would speak to RT then. ACTION 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. 3) Manchester's MON box has now been fixed so the site is now publishing again. It looks likely that some accounting data for January/February has been lost. SI-6 LCG Management Board Report --------------------------------- DB noted that he hadn't been able to attend because of another meeting clash. Issues covered by the meeting included: the weekly ops report; a mass storage outage at IN2P3; FTS problems at CNAF; CERN/CASTOR issues; Oracle/BigID probem can now be generated in a test instance - work was ongoing; summary from the Resource Review Board (RRB) and updated experiment requests had been presented. The Resource Scrutiny Group had reported that 08-09 pledges should be suitable for 09-10 - this was contested by the experiments. It had been proposed that the two groups work together and make recommendations by the summer relating to 09-10 resources. The meeting had noted that the EGI transition was a 'critical' issue, and should occur without disruption. There were discussions on storage. DB advised that we should proceed with the next procurement process but bear in mind that there might effectively be a 6 months' delay to handle at some point. DB advised that we should not make the decision to delay the hardware at the moment, although there may be a delay later on. AS agreed, noting that we should write that into the pre-qualification stage of tenders. SI-7 Dissemination Report -------------------------- SP reported that Neasan O'Neill was currently in Athens with the SSC workshop for EGI. The old Googlemap on the front page of the website had been replaced by a Tier-1 CPU plot. Eventually it would be good to put a Tier-1 dashboard here - this was being worked on. AS noted that the Tier-1 dashboard was in the 'late milestone' list - a plan did exist for attending to it - it will be September now before it is addressed, but the schedule of work will deliver it. REVIEW OF ACTIONS ================= 332.1 AS to provide a plan for the tape drives: The experiments would be producing new numbers by ~end of March which would enable this action to progress. O N G O I N G 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. PC had sent info round, and Robin Tasker would provide further info at GridPP 22 UCL. The issue would need to be referred back to the OC. Action DONE, but left open as placeholder. PC reported that a document now existed. DB would add the map to show the CERN/Tier-1 relationship, then the document should be complete for the OC. DONE, item closed. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). This issue was delegated to experiment reps in dTeam; the issue is reviewed each week at dTeam to advise of blacklisted sites. TD noted that a weekly review was fine. DB agreed, noting that the 'procedure' was less relevant than the fact that it was actually happening regularly. It is now a dTeam standing item, and there are links for each experiment to see which sites are blacklisted. DONE, item closed. 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. Decommissioning was still to be done. O N G O I N G. 341.3 NG to consult/inform GRIDPP PMB on responses to EU questionaires. O N G O I N G. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). JC would check the Minutes for clarity on this issue. TD noted it related to the request route for VOs in the UK. O N G O I N G. 341.7 SP to chase non-functioning of GridMon with Robin Tasker and report back. DONE, but here as placeholder - Mark Leese still to respond. DONE, item closed. 343.4 JC to raise the N-gram issue at dTeam and update the VO ID card. JC had received info generally re acceptable use policy. Card was done. DONE, item closed. 343.6 JC/JG to ensure that the joint taskforce meet and consider issues: work was clearly required on the table presented by AR - more columns needed to be added: NGS approach, GridPP approach, an outline map of funding, service overlaps. DONE, item closed. 343.7 AS to look into the CASTOR/ATLAS instance availability (in the light of the experiment red metrics). O N G O I N G. 343.8 TD to follow-up overdue milestone in relation to storage usage per user in a VO. TD had followed-up with Jens, progress was being made. DONE, item closed. 344.1 DB to respond to Janet Seed's request, but highlight to her that procedure for grant release needs to be fairly speedy, and not take 9 months. DONE, item closed. 344.2 SL to raise the benchmarking issue at dTeam in relation to the DB Minutes. NorthGrid & London were aware of the need for benchmarking. DONE, item closed. 344.3 JC to follow-up concerns with dTeam, over security and bandwith in relation to Camont needs at UKI sites. Karl Harrison & Mark Slater had been invited to the UKI meeting - issues had been discussed. DONE, item closed. 344.4 JC to contact Camont and seek assurance from them that they have thought-through issues of security/reliability/commercial use of networks, plus other concerns raised about content. 4 responses had been received, addressing each of the issues. DONE, item closed. ACTION 346.4 JC to summarise the responses from Camont for the PMB in relation to security, reliability, commercial use of networks, and suitability/acceptability of content. 344.5 TD to raise the 'inactive category' action 282.8 with SF (cc RM) re adding to the quarterly report. This needs to be documented. This was now being reported as part of the Quarterly Reports. It was agreed to delete inactive item 282.8. DONE, item closed. 344.6 GP to contact the ILC community and give our support, highlighting that there may be possible contention issues re STEP09 and fairshares. DONE, item closed. 345.1 JG to speak to RT regarding GridMon and GridPP funded network effort. O N G O I N G. 345.2 JC to follow-up the issue of UK Tier-2 sites not implementing the pilot role as requested in the LHCb VO-card - and report-back to the PMB as to why this had not been generally actioned. JC reported that the initial request had been sent round by himself just before Christmas - it hadn't been fully picked-up and was also not seen as a high priority for experiments. The request would be re- issued by dTeam. DONE, action closed. 345.3 JC to survey the rest of the Tier-2 sites re 32/64bit and SL4/5 hardware capabilities to find out if this problem is widespread. JC reported that this was not a widespread issue - sites have old hardware to be decommissioned. It was a Liverpool-only issue at present. DONE, item closed. 345.4 JC to ascertain from the experiments an indication of timescale in relation to the move to 64-bit SL5 (although it was expected that this would not be an issue currently) - JC to alleviate Liverpool's concerns meantime. JC reported that 6-12 months was a possible timescale. DONE, item closed. 345.5 GP and RJ to check their experiments' timescales in relation to the move to 64-bit SL5. This was now being addressed at higher-level. DONE, item closed. ACTIONS AS AT 05.05.09 ====================== 332.1 AS to provide a plan for the tape drives: The experiments would be producing new numbers by ~end of March which would enable this action to progress. AS now has info provided by CMS and Graeme Stewart - he will begin work on this soon. 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. 341.3 NG to consult/inform GRIDPP PMB on responses to EU questionaires. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). 343.7 AS to look into the CASTOR/ATLAS instance availability (in the light of the experiment red metrics). AS had spoken to GS & RJ and asked for a breakdown to achieve consistency - responses awaited. 345.1 JG to speak to RT regarding GridMon and GridPP funded network effort. 346.1 DB to check the current situation re website work and raise the whole NGI website issue with NGS. 346.2 DB to raise the issue of SL5 and 32/64-bit middleware at MB & GDB level, for policy decision/direction. 346.3 PC & AS to speak to Robin Tasker regarding receipt of high-level or ticket information from JANET on the service. 346.4 JC to summarise the responses from Camont for the PMB in relation to security, reliability, commercial use of networks, and suitability/acceptability of content. The meeting closed at 2.15 pm. It was confirmed again that the next PMB meetings in May would take place on Thursday 14th and Thursday 21st. There would be no PMB week commencing 25th.