GridPP PMB Minutes 365 (02.11.09) ================================= Present: David Britton (Chair), Sarah Pearce, Andrew Sansum, Tony Doyle, Jeremy Coles, Pete Clarke, Steve Lloyd, Roger Jones, Robin Middleton, John Gordon, Suzanne Scott, Minutes) Apologies: David Kelsey, Tony Cass, Glenn Patrick, Dave Colling, Neil Geddes 1. Status of Quarterly Reports =============================== SP reported that she was still waiting for Quarterly Reports from RJ, DC, AS, JC (draft had been submitted), RM. SP had received around 50% of the reports, but the remainder were urgently required. DB advised that he needed to review the issues for the quarter, probably the week after next, preferably in November. 2. Tier-1 Review ================= DB confirmed the Tier-1 Review for 14th December. Jamie Shiers, Maria Girone and TC would act as external reviewers. Michael Ernst could not make the meeting due to clashing priorities. Regarding timing, most people had indicated a preference for an earlier start and a 4.00 pm finish (re flights). DB noted he could arrange the schedule differently, there would be some scene-setting at the start, so it would be possible to change to 10.00 am start and 4.30 pm finish. He would discuss this offline. DB would circulate timings, once finalised. 3. Status of Tier-1 Issues =========================== a) EMC hardware ---------------- AS reported that they had been carrying out checks and investigations over the past 7 days - they had closed-off an avenue - there was a definite problem with the electrical supply, although this was difficult to pin down. They were escalating with the hardware suppliers but there was no further info as yet. The conclusion from the recent meeting was that they need to formulise a statement to suppliers, informing them that the supply is clean, and that it is now the supplier's issue to resolve - a clear statement was needed from the electrical team. DB noted that they should state to the company that the power supply was within spec, but that the equipment didn't work. AS agreed, noting that the other item they had agreed at the DM meeting was that they would prepare to move 2 of the 4 arrays into the LPD room. The would find out if the kit could be run there. They would continue to de-bug two of the units, and run two in a stable environment. They could use the ATLAS centre but would prefer not to. They were checking the current hardware configuration in relation to spares and maintenance. They still had to decide what, if anything, was being moved onto the ATLAS environment, and a schedule was required. b) CASTOR data loss -------------------- AS reported there was an ongoing review within the Department re data retention, and issues generally for the short and long term. In the short term, discussions with ORACLE were close to a conclusion, they were confident that they could run with 2 raid arrays and recommendations have been received in relation to changes in operational procedures. They were also reviewing back- up procedures. They needed to be confident before going back into a paired array configuration, that it would operate as required. c) Current Disk Procurement ---------------------------- AS advised that there was no final update available - hardware issues were under load test and three different problems had arisen: one was batch-related; one related to drive firmware; the third was the raid controller firmware. The outcome of testing had been mixed - there was no clear-cut result and a final analysis was awaited. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported that, in addition to the above, there were also tape robotic issues - the engineers were in today looking at the fibre-channel drives. SI-2 ATLAS weekly review & plans --------------------------------- RJ had no items to report at present. SI-3 CMS weekly review & plans ------------------------------- DC reported that things were fairly quiet at present. Discussions were ongoing in relation to possible improvements to be made, following the October exercise - these related to management issues. DC noted that the UK sites seem to have done fairly well last week looking at the blackboard, although Imperial had low availability because of the bdII problems, however it had 100% reliability for the time that it was back up, and Brunel had problems associated with trying to move to their new machine room, resulting in poor reliability this week (they hope to have the current set of problems fixed by Wednesday). SI-4 LHCb weekly review & plans -------------------------------- GP was absent. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) Following the ATLAS request for Frontier/SQUID in the UK sites have made good progress. There is now a frontier server at RAL and this is currently being tested. The Tier-2 status (https://twiki.cern.ch/twiki/bin/view/Atlas/T2SquidDeployment ) is such that 3 sites have yet to deploy Squid. 2) Several decisions were made by the WLCG MB last week that impact operations. Specifically: " WLCG sites should now deploy a CREAM CE service in parallel with the production LCG-CEs in order to gain real production experience. Although the ability to submit from Condor clients is still missing, the site installation will not change when this is available. The CREAM CEs should also now be marked as “production” in the information system. glexec and SCAS are now ready for deployment. Experiments need to run multi-user pilot jobs, particularly for analysis. It is important that sites are able to support this in accordance with the agreed policies through the deployment of glexec and SCAS. This deployment should happen rapidly now". We currently see SCAS as the more urgent priority and plan on having one instance per Tier-2 in the coming few weeks. The Tier-1 is starting on SCAS deployment this week. Once clear about the stability and deployment process, further sites will be aided in the deployment. It would be useful to have an indicative date by when the experiments intend to switch on using multi-user pilots. For glexec an issue for some sites is the need to have root access to the WNs (requires setuid root). It is not clear what sites, using the WN tarball, need to do to install glexec to the WNs. For CREAM, the UK already have 3 sites that have working installations. The next step is to move these to the production state and then deploy further CEs so that we have one instance in each Tier-2 (to aid knowledge transfer). Wider deployment is likely to take place on the January/February timescale but any site is free to deploy earlier. 3) There is a WLCG Collaboration Board meeting at CERN on November 13th (PM) to address the preparedness of WLCG sites for the transition from EGEE to EGI. Each Tier-1 is asked to send a representative to give a short update on their readiness for and the impacts expected from moving to the NGI. 4) UK sites did participate in the ATLAS user analysis tests last week, although not all intended data sets were in place due to the inability to transfer the data while clean-up operations were underway following the Tier-1 disk server problems and data losses. Interestingly this week ATLAS has indicated its plans to reduce MC distribution in the UK to one copy due to space constraints. Previously (when disk was plentiful) ATLAS UK had decided to run with two copies of data and MC across the T2s. Now with MC disk usage well over expectations a cut is being made. The UK pledge to ATLAS is 297TB and they are currently using 333TB. 5) Brunel (storage) and Lancaster are both having teething problems with new hardware. 6) Decommissioning of an old SE is taking place at QMUL, while at Manchester its two DPMs are being merged. SI-6 LCG Management Board Report --------------------------------- JG gave an update of GDB issues in relation to SCAS and the CREAM CE. There had been a weekly ops report at which the issue of data loss at RAL had been discussed. Under AOB, various issues had arisen: - DPM support: DB had been asked to bring this up. CERN noted that DB's facts were wrong and that no issue existed. This is clearly not the case. It was noted that something would no doubt happen in relation to this, and DB would discuss DPM further with Dirk. - Staged deployment of resources: there is a wiki page. - It was noted that the 2010 pledge installation was agreed as June, not April. - issues re Christmas and what activity would be happening. - there was a plea that, re the request for Tier-1 accounting comments, responses are sent in a timely fashion. - encouragement of sites re database workshop on backup and integrity - RAL had been noted as not being involved, when in fact they had organised the event last time round. SI-7 Dissemination Report -------------------------- SP reported that Neasan O'Neill had been working on different versions of the CUE proposal. A new version of the RTM would be released shortly. He was preparing for SuperComputing - were RJ or DC going? RJ noted he would speak to DC and then would let SP know. SP had received a report from Karl re the Summer student at Birmingham, which she would circulate. AOB === Tier-2 Hardware --------------- SL reported that since the last discussion, he had updated the global requirements, the UK shares and the costings provided by DB. The changes meant that the Birmingham share had gone up, and Durham was re-balanced in relation to ALICE. SL had put the costs one year in advance with a 10% margin. DB asked SL to send the information to the experiment representatives and give them a deadline to respond. After this, the information should go to the CB - SL should send them a high-level summary only, focussing on past performance at sites, noting that this has been done with experiment approval, and attach the final table. It was agreed that SL would check the pledge fraction and increase the total kit overall. UKQCD ----- JC asked about George Beckett and the funding request - following the email which DB had circulated. DB reported that this was a request from an 'other' experiment, UKQCD. We did not get support to fund them in GridPP3, therefore they had written a proposal to OMII UK and they received funding. Now they needed to use the Grid, and this would come under GP's remit at the User Board, also involving Janusz and Stephen Burke. The request needed to go to the UB in the first instance. The issue would need to be raised with SB in order to find out what help UKQCD require. DB had already cc'd SB; and he had suggested that George Beckett get in touch with GP direct. REVIEW OF ACTIONS ================= 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. ONGOING. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. ONGOING. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. SP reported there had been an email exchange and she had sent suggestions on how to move forward. JC had met with Andy Richards at RAL. One of the issues was uncertainty in relation to funding - SP needed more detail re resources and options for the future, also for EGEE-funded manpower at present and what we have signed up to in NGI. She was awaiting a response. DB noted that NG was trying to understand the proposals and the funding fractions. In defining GridPP4 we needed to define these posts and responsibilities. JC noted that JG had suggested going through the EGI proposal document for info. ONGOING. 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- JC would check the situation with Graeme Stewart (who was currently on annual leave). ONGOING. 359.5 Lee Barnby (experiment rep) still to contact Neasan O'Neill, advising where, specifically, the GridPP website should point to for his experiment, in terms of user support information. (Graeme Stewart, DC & RN had already done this). SP to follow-up. ONGOING. 359.6 SP to ensure that Neasan O'Neill updates the GridPP website accordingly (once experiment reps have provided info as to where the GridPP website should point to for each of their experiments, in terms of user support information). ONGOING. 361.3 JC and AS to check Tier-1 and Tier-2 gstat2 results (in relation to SL5 having been discussed at the GDB). JC reported that he had checked this, but information from AS was still awaited. JC noted that there were issues with the results. ONGOING. ACTIONS AS AT 02.11.09 ====================== 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. SP reported there had been an email exchange and she had sent suggestions on how to move forward. JC had met with Andy Richards at RAL. One of the issues was uncertainty in relation to funding - SP needed more detail re resources and options for the future, also for EGEE-funded manpower at present and what we have signed up to in NGI. She was awaiting a response. DB noted that NG was trying to understand the proposals and the funding fractions. In defining GridPP4 we needed to define these posts and responsibilities. JC noted that JG had suggested going through the EGI proposal document for info. 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- JC would check the situation with Graeme Stewart (who was currently on annual leave). 359.5 Lee Barnby (experiment rep) still to contact Neasan O'Neill, advising where, specifically, the GridPP website should point to for his experiment, in terms of user support information. (Graeme Stewart, DC & RN had already done this). SP to follow-up. ONGOING. 359.6 SP to ensure that Neasan O'Neill updates the GridPP website accordingly (once experiment reps have provided info as to where the GridPP website should point to for each of their experiments, in terms of user support information). 361.3 JC and AS to check Tier-1 and Tier-2 gstat2 results (in relation to SL5 having been discussed at the GDB). JC reported that he had checked this, but information from AS was still awaited. JC noted that there were issues with the results. The next PMB would take place on Monday 9th November at 12:55 pm.