GridPP PMB Minutes 344 - 20th April 2009 ======================================== Present: David Britton (Chair), Sarah Pearce, Tony Doyle, Andrew Sansum, Dave Colling, Roger Jones, Glenn Patrick, Steve Lloyd, Tony Cass, John Gordon (Suzanne Scott - Minutes) Apologies: David Kelsey, Robin Middleton, Jeremy Coles, Pete Clarke, Neil Geddes 2. Quarterly Reports ===================== SP reported that she had sent reminders in the last few days. She had created a new Quarterly Report template for the year, and this had been sent to everyone. SP asked the PMB to note that the Reports were due by month end. 1. Status/Timetable: R89?/CASTOR 2.1.8?/STEP09? ================================================ AS reported that he was planning on the assumption that R89 would be available, based on the consultants' opinions. He believed that the machine room would perform as required. DB asked whether the new machine room was spec'd without much redundancy? What would happen if it came in 'under spec'? AS replied that the cooling capacity was there but the airflow was not strong. TD commented that 8, 7, or 6 kw per rack was a very low figure, looking forward. DB asked what fraction of the room would be filled by current and 2009 purchases? AS advised that it would likely be 50% full. DB advised that we could not migrate to the room if we couldn't cool the kit properly. AS advised that we could see as things progressed, how the room responds to the incoming kit - they won't move the Tier-1 in if there are likely to be problems. DB asked what was the fallback position re the next purchase of kit, in relation to cooling capacity? AS noted that he wasn't unduly concerned at the moment, but was taking things forward carefully. DB asked about STEP09? AS reported that this would affect the first two weeks of June, probably to the 16th. JG advised that the overlap was the middle two weeks in June. There was a discussion on timescales. DB noted that the experiments need to note the Tier-1 downtime, for the move to R89 is likely to happen soon after STEP09 and factor this into their schedules. DB asked what the consensus was on CASTOR 2.1.8? AS responded, essentially, no. He would report in the Tier-1 report (below) that he had a clear and detailed response from CERN re support of 2.1.7, which looks positive. RJ noted that the upgrade might therefore happen during data taking. DC asked what the advantage was to upgrade? JG advised repacks and xrootd. AS observed that the experiments were not pushing this as urgent - the upgrade was not needed at present from their point of view. GP confirmed that he favoured reliability and stability over new features. 3. Tier-2 Hardware =================== DB reported that there had been a request from Janet Seed in relation to the Tier-2 Hardware spend, either this or next year. DB advised that four institutes were keen to get funding this year (Edinburgh, Cambridge, Sheffield and Liverpool), however the majority were happy to wait until next year. Two institutes had not responded. DB proposed that he tell Janet that we want £400k this financial year, with a proportion deferred until next financial year. This was agreed. ACTION 344.1 DB to respond to Janet Seed's request, but highlight to her that procedure for grant release needs to be fairly speedy, and not take 9 months. 4. Resource Review Board (RRB) @ CERN ====================================== DB had circulated an email relating to points for Janet Seed to address with respect to UK participation in the experiments. TD asked whether the experiment global numbers had been signed off? No. JG advised that Ian Bird was writing a combined paper at present. AS asked about internal planning for tape? DB advised that we should use the new capacity numbers, the current bandwidth numbers will not be the final ones, and we should keep the current bandwidth model intact. We should also stick with current tape drives but reduce tape capacity in line with experiment requirements. DB asked if there were any other points to raise with Janet Seed for the RRB? None. There followed a discussion about Benchmarking re the Tier-2 hardware. SL asked whether the logs had been kept long enough, and was it possible to do benchmarking? DB advised that, on examining the batch numbers, it appeared that there were numbers for CPU which were being both over and under- reported - we needed to address gross errors. The Deployment Board had decided to use HEPSPEC06 numbers, however it was understood that this would not be accurate for all sites. DB noted that we need to understand what is possible - then decide how to do this. SL confirmed that the logs were therefore required. It was noted that there was an action regarding this issue from the Deployment Board meeting. DB advised that it works out ok unless a lot of kit is installed during the period - we need to establish a process quickly. JG cautioned about numbers of jobs being run. ACTION 344.2 SL to raise the benchmarking issue at dTeam in relation to the DB Minutes. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) The R89 machine room will probably be considered to be suitable to accept equipment. We expect to make make this decision this week. 2) Migration to R89 is expected to commence (subject to the above) on Monday 22nd June and take 2 weeks. Once the above decision is made we will contact hardware suppliers who are expected to carry out the migration. Once we have confirmation that they can carry out the work we will publicise the dates. We would expect this process is likely to take a further 2 weeks. 3) Disk, CPU and robotics deliveries are provisionally scheduled to take place about 20th May (subject to R89 and supplier confirmation). Staffing: 1)An offer remains outstanding on one experiment support position. STFC has not (re-)approved the second post. 2) Shortlisting has been completed on the EGEE funded PPS post 3) An offer has been made on a YII student (funded by ESC) 4) The CASTOR d/b admin has not yet been approved for external recruitment. An internal search for suitable staff is underway. Service: 1) SAM availability last week was 100% 2) CASTOR a) We have now received input from the experiments wrt plans for a CASTOR 2.1.8 upgrade. See Matt Viljoen's note below: CASTOR 2.1.8 Upgrade Strategy at RAL ------------------------------------ We have now been given feedback from all our major experiments, who support our Upgrade Strategy. Experiments have indicated that they are happy with our current service and stability and (apart from ALICE) do not require the new features in 2.1.8. ALICE have offered to be a test for 2.1.8. However, as outlined in our Upgrade Strategy, it is highly unlikely that we will have time to upgrade to 2.1.8 if we are moving to the new machine room. Regardless of when we upgrade to 2.1.8, we are working to improve stability and resilience before data taking starts. Apart from the database upgrade, we intend to roll out a number of discrete changes such as upgrading the tape mounting library (VDQM), reconfigure LSF and perform a final 2.1.7 upgrade that will allow us to turn on synchronization without risking a repetition of the crosstalk bug. ------------------------------------ b) Progress has been made by CERN and Oracle wrt the BIGID problem. We are waiting to see CERN's written assesment of the situation. c) The ORACLE database RAID array upgrade tests have now been carried out. We expect the upgrade to take 2 days. It is expected to be scheduled in mid May. 3) The FTS and LFC are planned to be upgraded to new hardware on 6th May. This will lead to a 1-day downtime of these services. 4) We have recently been suffering stability problems with the National top level BDII. We suspect that this is load (VO/site) related and are following this up with those concerned. 5) The use servey is nearly complete. We still need to receive questionaires back from two LHC experiments after which we will provide a writeup. Feedback so far has been both interesting and useful. 6) A number of services (ce01, ce03, wms03, fts) are scheduled for a short reboot tomorrow to resolve power distribution issues. SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that there had been a new production record last week, only one site was down. RAL had CASTOR issues last week, which were fixed by Friday. ATLAS are getting ready for STEP09 and a lot of user analysis for the Tier-2s. RJ reported that Liverpool were not being given CMS jobs. DC advised that it does take work to install requirements - Stuart Wakefield or Chris Brew could help NorthGrid. DC advised that they were doing what they could with the personnel they had. SI-3 CMS weekly review & plans ------------------------------- DC reported that SAM tests had been carried out on the 17th; things were reasonably quiet at the moment - a lot of staff were away at conferences. SI-4 LHCb weekly review & plans -------------------------------- GP reported problems with LHCb jobs last week, due to LHCb software installation causing jobs to fail. They were undertaking more Monte Carlo production. SI-5 Production Manager's Report --------------------------------- In absentia JC submitted the following report: 1) We discussed the new camont needs at the UKI sites meeting last Thursday. The main concern was about the bandwidth requirements to the sites (depends on number of simultaneous jobs running at a site). Mark Slater said that they would be running tests on some leading camont sites before rolling out the jobs wider. Generally sites seemed supportive of the activity but whether this translates into additional resources available remains to be seen. Site networking groups are to be notified of expected changes in network usage patterns just before processing starts in earnest. It was noted that RAL had concerns over security and bandwidth. It was agreed that JC should follow this up with dTeam - comments should be submitted to JC. ACTION 344.3 JC to follow-up concerns with dTeam, over security and bandwith in relation to Camont needs at UKI sites. AS thought that Camont need to be asked whether they have considered their responses to questions or approaches about the work that they do - this needs to be formalised. ACTION 344.4 JC to contact Camont and seek assurance from them that they have thought-through issues of security/reliability/commercial use of networks, plus other concerns raised about content. 2) John circulated an MoU for the reservoir project on 17th April. There was no PMB response. Is this an area that we should be actively pursing? "RESERVOIR is the flagship of cloud computing research in FP7. The project aims to provide Resources and Services Virtualization to enable massive scale deployment and management of complex IT services across different administrative domains, IT platforms and geographies". Re cloud computing, it was noted that an MoU existed between EGEE and the Reservoir Project. DB noted that this needs to be done via the experiments - the project is looking for test sites to see if it works within the EGEE computing model. This related to an email from JG on Friday 17/4 & Steve Newhouse/NA4. Could the PMB look at the email - JG can send to all sites. The issue should be raised at the next NA4 meeting (SS to do). [Done following the meeting]. SI-6 LCG Management Board Report --------------------------------- JG reported that he had circulated a note re the CMS quarterly report - a lot of sites were appearing to be more reliable now. JG gave GDB feedback. SI-7 Dissemination Report -------------------------- SP reported that Neasan O'Neil had put up a news item about CHEP, GridPP22, and the IoP meeting. He was carrying out a general review of the GridPP website and updating it, also working on the 'help' pages. He had also put up a KE/EI webpage (relating to an action from the OC) which appears in the 'About' section. REVIEW OF ACTIONS ================= 332.1 AS to provide a plan for the tape drives: The experiments would be producing new numbers by ~end of March which would enable this action to progress. O N G O I N G. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. PC had sent info round, and Robin Tasker would provide further info at GridPP 22 UCL. The issue would need to be referred back to the OC. Action DONE, but left open as placeholder. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). O N G O I N G. 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. O N G O I N G. 341.1 AS to review reslience of services that may have to remain in the ATLAS building. DONE, item closed. 341.3 NG to consult/inform GRIDPP PMB on responses to EU questionaires. O N G O I N G. 341.4 JG to establish new date for the wLCG-ORACLE meeting. DONE - meeting unlikely, Oracle seminar might be more possible. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). O N G O I N G. 341.7 SP to chase non-functioning of GridMon with Robin Tasker and report back. DONE, but here as placeholder - Mark Leese still to respond. 342.1 DB to circulate a draft letter from GridPP to Richard Wade in relation to the pending 2nd experiment support post at RAL [done following the meeting]. DONE, item closed. 342.2 SP to add-in a morning discussion time about GridPP4 to the F2F meeting; also to check if there were any issues arising from the Quarterly Reports that needed to be discussed. DONE, item closed. 343.1 GP to follow-up on the ILC experience to ensure issues are addressed. DONE, item closed. 343.2 GP to discuss new users coming to GridPP with NO (possibly also SB) regarding the GridPP website and making relevant changes that would assist them. DONE, item closed. 343.3 Re IMENSE & the iLexIR N-gram enquiry, DB to draft a response to Andy Parker supporting the request to use the Tier-1 before the LHC turns on. DONE, item closed. 343.4 JC to raise the N-gram issue at dTeam and update the VO ID card. 343.5: SP to contact Tier-2 P.I.'s to discuss Tier-2 grant timing. DONE - DB to respond to Janet Seed. 343.6 JC/JG to ensure that the joint taskforce meet and consider issues: work was clearly required on the table presented by AR - more columns needed to be added: NGS approach, GridPP approach, an outline map of funding, service overlaps. O N G O I N G. 343.7 AS to look into the CASTOR/ATLAS instance availability (in the light of the experiment red metrics). O N G O I N G. 343.8 TD to follow-up overdue milestone in relation to storage usage per user in a VO. TD noted that a report was awaited from Jens Jensen. O N G O I N G. 343.9 DB to contact Akram re the summer student call. DONE - calls will be considered. ACTIONS AS AT 20.04.09 ====================== 332.1 AS to provide a plan for the tape drives: The experiments would be producing new numbers by ~end of March which would enable this action to progress. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. PC had sent info round, and Robin Tasker would provide further info at GridPP 22 UCL. The issue would need to be referred back to the OC. Action DONE, but left open as placeholder. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. 341.3 NG to consult/inform GRIDPP PMB on responses to EU questionaires. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). 341.7 SP to chase non-functioning of GridMon with Robin Tasker and report back. DONE, but here as placeholder - Mark Leese still to respond. 343.4 JC to raise the N-gram issue at dTeam and update the VO ID card. 343.6 JC/JG to ensure that the joint taskforce meet and consider issues: work was clearly required on the table presented by AR - more columns needed to be added: NGS approach, GridPP approach, an outline map of funding, service overlaps. 343.7 AS to look into the CASTOR/ATLAS instance availability (in the light of the experiment red metrics). 343.8 TD to follow-up overdue milestone in relation to storage usage per user in a VO. 344.1 DB to respond to Janet Seed's request, but highlight to her that procedure for grant release needs to be fairly speedy, and not take 9 months. 344.2 SL to raise the benchmarking issue at dTeam in relation to the DB Minutes. 344.3 JC to follow-up concerns with dTeam, over security and bandwith in relation to Camont needs at UKI sites. 344.4 JC to contact Camont and seek assurance from them that they have thought-through issues of security/reliability/commercial use of networks, plus other concerns raised about content. 344.5 TD to raise the 'inactive category' action 282.8 with SL (cc RM) re adding to the quarterly report. This needs to be documented. 344.6 GP to contact the ILC community and give our support, highlighting that there may be possible contention issues re STEP09 and fairshares. INACTIVE CATEGORY ================= 282.8 RM to monitor how R-GMA and networking issues impact on GridPP as matters progress. RM advised that this item should be moved to the 'inactive' category as it will develop over the coming months. RM discussed the issue with Steve Fisher and advised that support of R-GMA is required whilst APEL is dependent on it. RM reported that he has spoken to SF and there is currently no change to the R-GMA situation - process ongoing. RM advised that a small amount of effort was going into R-GMA on APEL but for the long term he wasn't sure. The item needed to be kept here for review from time to time, and required to be re-visited around Easter 2009. ACTION 344.5 TD to raise the 'inactive category' action 282.8 with SL (cc RM) re adding to the quarterly report. This needs to be documented. AOB === DB noted that GP had circulated an email request from the ILC community, but there was a conflict with STEP09. GP advised that their request would not conflict with other experiments - he would take the space from BaBAR and others - there was no issue there. ACTION 344.6 GP to contact the ILC community and give our support, highlighting that there may be possible contention issues re STEP09 and fairshares. The meeting closed at 2.20 pm. The next PMB would take place on Monday 27 April at 12.55 pm.