GridPP PMB Minutes 369 (30.11.09) ================================= Present: David Britton (Chair and Minutes), Sarah Pearce, Andrew Sansum, Tony Doyle, Jeremy Coles, Pete Clarke, Roger Jones, Robin Middleton, John Gordon, David Kelsey, Tony Cass, Dave Colling. Apologies: Sarah Pearce, Glenn Patrick, Steve Lloyd, Neil Geddes, Suzanne Scott DB noted that the focus of this meeting should be to review the status of the long list of outstanding actions. 1. Tier-1 Intervention for UPS Test ===================================== AS had circulated an email concerning the scheduling of a UPS bypass test, which is required in order to allow eventual resolution of the UPS noise problem. Originally to be done under an "at risk" declaration, the test had twice been scheduled and twice cancelled as the LHC made quicker than expected progress. At this point AS stated that, considering responsibilities to other user- communities, the UPS test could not be scheduled and cancelled again. In addition, the inability to define the level of risk had lead them to the conclusion that the test should be done under a "downtime" declaration to allow the orderly shutdown of the CASTOR databases in advance so as to guard against a prolonged recovery process should something go wrong. AS suggested two possible dates, Mon 7th December or early January. RJ and GP (via email) expressed a strong preference by ATLAS and LHCb for the January date, anticipating that there was some possibility of high-energy collisions by the Dec 7th date. The PMB concurred. AS to try and schedule for as early as possible in the week of January 4th. 2. Week's Notes ================ Status of Tier-2 HW grants - DB would pick up by email as neither SL or SP were present. Multi-User-Pilot-Jobs - JG had requested a formal statement from Ian Bird. The statement had gone through several iterations with the wLCG MB and JG was waiting for a final version. He will circulate statement, with a UK wrapper, to UKHEPGRID soon. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric ====== 1) Lot 2 of disk servers have failed acceptance. Supplier has two configurations that have been found to work under testing. Meeting today with supplier to agree how to proceed. If re-deployment is successful we would expect hardware to emerge from certification in around mid- February (possibly earlier if re-deployment is low impact intervention). 2) New procurements have started. - Disk order has been placed. - CPU tender is closed and evaluation is complete. Standstill ends this week. 3) Procurement is underway for an additional 4*1Gb/s second OPN link to CERN as resiliant backup. Difficulties progressing this through procurement and on to JANET. Robin is investigating. 4) The UPS room supply causes instability on our EMC RAID arraus. We are working on schedules that lead to a short downtime on 5th January to move services back to the EMC units, either on clean LPD room power or on fixed UPS room power. Need to agree schedule of work. Service ======= 1) SAM availability for the OPS VO was 100%. It has been a quiet week for operations. Weekly production report is at: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Experiments_Liaison_Meeting_Oper ations_Reports 2) Problem where sometimes batch jobs land on the wrong node appears to be resolved following a server restart. 3) Load related problems were experienced on the ATLAS CASTOR instance on the evening of the 23rd. FTS channel settings were reduced in order to manage the problem. The SRB database service was moved to a node with higher memory and this should have resolved the problem. We are now waiting for a period of heavy use again to assess the change and allow us to retune the channel settings to what is necessary to meetATLAS requirements. 4) A problem has been identified with the CASTOR Information Provider (CIP) that severely impacts T2K. Owing to current LHC operations, we are averse to carrying out an emergency change that might impact other VOs. We are investigating resolving the problem via lower risk means. SI-2 ATLAS weekly review & plans --------------------------------- RJ reported on a frontier server problem: advice from BNL and others had lead to a new configuration to be deployed. SI-3 CMS weekly review & plans ------------------------------- DC had nothing to report. SI-4 LHCb weekly review & plans -------------------------------- GP reported as follows: Very low level of running. So, no major problems seen from site point of view. 1. T1 diskserver failure earlier this morning in lhcbDst service class. Awaiting further information. 2. LHCb application problems still under investigation SI-5 Production Manager's Report --------------------------------- Deployment wise the last week has been relatively quiet. 1) Following a brief period of data taking last week there were no major problems reported at any GridPP sites. We are asking sites to share specific local issues that they see in order to better prepare other sites (e.g. additional installation requests). It was noted that on Monday queues at popular sites (e.g. BNL) were over-subscribed for ATLAS. The deployment team will follow up on other observations at the dteam/sites meeting tomorrow. Let me know if there are any questions you want asked! 2) The situation with kernel updates across GridPP sites is good. (One concern to be noted in the meeting). EGEE is moving to suspending unpatched sites via OSCT/ROC agreement. 3) There have been a few site queries (those who need the money this financial year) about the timeline for the second tranche of GridPP3 hardware money. Can we now layout the timeline for all sites to see? 4) The storage group are proposing to hold a workshop in Edinburgh to review the status of SRMs & SEs (current problems, solutions and future directions). Will funding support be available from GridPP if this workshop goes ahead? For information: A) There is a GDB this Wednesday: http://indico.cern.ch/conferenceDisplay.py?confId=64669. Topics include LHC Status, Pilot jobs, Security, Middleware update, EGI update and Batch Systems. B) There was a CA TAG meeting last week. C) LHCb have updated their VO card to request 150GB of software space at Tier- 2s. SI-6 LCG Management Board Report --------------------------------- The MB were asked by the OPN community to consider three areas that broadened their mandate slightly: 1) To consider all T1-T1 traffic as an OPN issue; 2) to suggest that the OPN community look at T1-T2 traffic use-cases; 3) that the OPN should investigate SLA's that define the bandwidth rather than a simple up/down metric for the links. The PMB noted that it was unclear between whom this SLA would exist since each link had two ends. AS would talk with Robin Tasker. DB would send AS the slide from the MB. The other main issue at the MB was the Multi-User-Pilot-Jobs (discussed above). REVIEW OF ACTIONS ================= 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. DB - moved this to an inactive catagory. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. This had been done. Action CLOSED. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. SP reported there had been an email exchange and she had sent suggestions on how to move forward. JC had met with Andy Richards at RAL. One of the issues was uncertainty in relation to funding - SP needed more detail re resources and options for the future, also for EGEE-funded manpower at present and what we have signed up to in NGI. She was awaiting a response. DB noted that NG was trying to understand the proposals and the funding fractions. In defining GridPP4 we needed to define these posts and responsibilities. JC noted that JG had suggested going through the EGI proposal document for info. This was ONGOING 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- JC would check the situation with Graeme Stewart (who was currently on annual leave). Not a priority; This was moved to the INACTIVE CATAGORY 361.3 JC and AS to check Tier-1 and Tier-2 gstat2 results (in relation to SL5 having been discussed at the GDB). JC reported that he had checked this, but information from AS was still awaited. JC noted that there were issues with the results. JC noted that the action as worded was done weeks ago. It is an ongoing deployment team action on the Tier-2 coordinators to get the information corrected. JC action done. AS still to respond. This was DONE 366.2 RJ to provide ATLAS HW requirements for 2011-15. RJ & DC had a preliminary discussion - they need to agree common profile, even if it is flat cash. This was ONGOING 366.3 DC to provide CMS HW requirements for 2011-15. In progress. This was ONGOING 366.4 GP to provide LHCn HW reqiremens for 2011-15. GP had started this. DB noted he needed the numbers for hardware costings and needed something soon to begin work. The deadline was 2 weeks for prelim. numbers. AS would look at them as well. GP had provided some preliminary numbers. ACTION DONE. 366.5 SL/DB to estimate what fraction of STFC funding goes to non-LHC groups. What about the theory side? ONGOING 366.6 GP to invite input from Other Experiments. GP had done this and was compiling input. ACTION DONE 366.7 DB (in consultation with AS) to provide HW-cost estimates for 2011 - 2015. DB was awaiting inputs. ONGOING 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. AS noted this depended on money costs. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? Early action on AS required. AS had sent tech questions round the team and would forward inputs when available. This is ONGOING - will be closed when capacity requirements known. 366.9 RJ to confirm that ATLAS supports the use of Tape storage in the period 2011-2015. RJ noted they had a belief in the archival work but the cost was to be provided by the provider. Tape would have a front-end staging system. DB asked whether they might want to move to another model? We should not assume that tape will do. A statement was required. RJ reiterated that ATLAS made no requirements or assumptions about the technology used. ACTION CLOSED. 366.10 DC to confirm that CMS supports the use of Tape storage in the period 2011-2015. DC noted that CMS agreed with ATLAS. ACTION CLO SED 366.11 GP to confirm that LHCb supports the use of Tape storage in the period 2011-2015. ONGOING 366.12 SP to liaise with AS to establish non-capacity costs. SP advised that discussions had started. DB noted a long-term question about the model. ONGOING 366.13: SP to request and collect first cost estimates of posts for GridPP4. FEC and non-FEC posts need to be costed. The Tier-1 posts should be costed as accurately as possible as soon as possible since there is a large lever arm here. DB was iterating with SP using GU as an example. ACTION CLOSED 366.14 DK to provide first estimate of average RAL post cost on the basis of the current distribution of posts/grades. Clearly this will need refinement once we understand the final mix better. DK had already started this and would have estimates later this week. ONGOING 367.1 ALL: to send email responses/thoughts to DB, or to the list, on NGI issues discussed. ACTION CLO SED. 367.2 RM to fill-in the grey boxes on DB's UK NGI diagram of a minimal NGI, as to what NGS would be doing in the areas listed. RM noted that he was awaiting return of Andy Richards. ONGOING 367.3 JG to contact Ian Bird directly, immediately, and ask for a clear formal statement in relation to multi-user pilot-jobs by the experiments. This formal statement was required immediately - we could not wait for this issue to be brought up at the next MB. ACTION DONE. 367.5 JG to send formal information round the community re multi-user pilot- jobs, once clear statements had been received from Ian Bird (via JG) and the experiments (via JC). ONGOING 367.6 RJ to submit a proposal to the PMB for funding assistance for the next ATLAS tutorial. ACTION DONE 368.1 DB to circulate an initial informal paper on NGI Interface in advance of the upcoming F2F in order to form a basis for further discussion. ACTION DONE 368.2 DB to circulate an initial informal paper on Tier-2 Structure in advance of the upcoming F2F in order to form a basis for further discussion. DB noted that this had been done but there was now a new action: 369.1: DB to draft a paper on Tier-2 structure for CB. 368.3 SP to circulate an initial informal paper on Project Management in GridPP4 in advance of the upcoming F2F in order to form a basis for further discussion. DONE 368.4 SP to circulate an initial informal paper on Economic Impact, Knowledge Exchange and Dissemination in advance of the upcoming F2F in order to form a basis for further discussion. ONGOING 368.5 DB/AS to circulate an initial informal paper on Hardware Requirements in advance of the upcoming F2F in order to form a basis for further discussion. ONGOING 368.6 AS/DB to circulate an initial informal paper on Tier-1 Role and Requirements in advance of the upcoming F2F in order to form a basis for further discussion. ONGOING 368.7 TD to circulate an initial informal paper on Technical (Middleware) Support in advance of the upcoming F2F in order to form a basis for further discussion. ONGOING The pmb discussed the various areas in this space: Manchester had declined to include gridSite in EMI so it was clearly not a priority and there was not pressure for matching funding. The rest of the security work in GridPP3 would fall under the NGI (operational security and security policy) and, as previously discussed, GridPP4 would buy into this at the level of funding about 1 FTE in the NGI security team. There was a discussion about the security vulnerabilities work with different opinions as to whether this should lie more in the GridPP or NGI space. On Networks, the PMB accepted that the GridMon type work was finished. There would be a need for a Tier-1 network (0.5 FTE) post that covered the OPN link. AS would discuss with Robin Tasker. On RGMA - this work was being phased out in year-2 of gridPP3 and would not naturally restart in GridPP4. On WMS - this work had only been funded in the first year of GridPP3 and would not re-start in GridPP4. TD raised the possibility of future technologies - extending from the use of muti- core processors to many-core processors and then possibly to GPUs. After some discussion, it was felt that the many-core issues would need to be tackled as an operational problem in GridPP4 but that GPU exploitation probably sat outside the GridPP4 mandate. TD noted that data-management and data-storage were the key areas of support that GridPP4 would need to continue to provide. DC asked what the difference was? TD noted that there was a large overlap but that data-management was typically higher-level relating to the transfer of data by (eg FTS) over the network, and reaching down to the actual storage layers. Data-storage was the lower level stuff that started at the data storage level and reached upwards into the data management space. Something like 3 FTE would be needed to create a critical mass in this area. 368.8 JC to circulate an initial informal paper on Deployment Support in advance of the upcoming F2F in order to form a basis for further discussion. ONGOING 368.9 GP to circulate an initial informal paper on Experiment Support in advance of the upcoming F2F in order to form a basis for further discussion. ONGOING 368.10 TD to circulate an initial informal paper on Cloud Computing in advance of the upcoming F2F in order to form a basis for further discussion. It was decided that the GridPP4 proposal should include a request for some seed money to investigate commercial cloud computing offerings from a UK/LHC perspective. There were quite a number of such studies in existence, which tended to show that the cost of getting data in/out of the cloud was prohibitively expensive. Nevertheless, circumstances will have changed in 2 years time and the UK-specific issues had not been addressed. It was envisaged that something of the order of £100K should be top-sliced from the Tier-2 hardware budget; about £10k should be used in a trial and £90k reserved for use later in the project in order to smooth out peaks in demand. It was anticipated that, once the real cost/convenience of commercial offerings had been understood, then Tier- 2s could compete for the balance of the funding on a commercial basis. 368.11 SP to circulate an initial informal paper on Financial Planning in advance of the upcoming F2F in order to form a basis for further discussion. ONGOING 368.12 ALL: comments on Tier-2 structure to be sent to DB. DONE 368.13 ALL: comments on Project Management to be sent to SP. DONE 368.14 AS to iterate with Gareth in relation to actions required for downtime communications. DONE 368.15 GP, DC & RJ to provide experiment input to DB/AS for 'Hardware Requirements' initial document for discussion at Imperial, which DB/AS would prepare. REPEATS earlier actions so DELETED ACTION AS OF 30.11.2009 ------------------------- 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. 366.2 RJ to provide ATLAS HW requirements for 2011-15. RJ & DC had a preliminary discussion - they need to agree common profile, even if it is flat cash. 366.3 DC to provide CMS HW requirements for 2011-15. In progress. 366.5 SL/DB to estimate what fraction of STFC funding goes to non-LHC groups. What about the theory side? 366.7 DB (in consultation with AS) to provide HW-cost estimates for 2011 - 2015. DB was awaiting inputs. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. AS noted this depended on money costs. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? Early action on AS required. AS had sent tech questions round the team and would forward inputs when available. This is ONGOING - will be closed when capacity requirements known. 366.11 GP to confirm that LHCb supports the use of Tape storage in the period 2011-2015. 366.12 SP to liaise with AS to establish non-capacity costs. SP advised that discussions had started. DB noted a long-term question about the model. 366.14 DK to provide first estimate of average RAL post cost on the basis of the current distribution of posts/grades. Clearly this will need refinement once we understand the final mix better. DK had already started this and would have estimates later this week. 367.2 RM to fill-in the grey boxes on DB's UK NGI diagram of a minimal NGI, as to what NGS would be doing in the areas listed. RM noted that he was awaiting return of Andy Richards. 367.5 JG to send formal information round the community re multi-user pilot- jobs, once clear statements had been received from Ian Bird (via JG) and the experiments (via JC). 368.4 SP to circulate an initial informal paper on Economic Impact, Knowledge Exchange and Dissemination in advance of the upcoming F2F in order to form a basis for further discussion. 368.5 DB/AS to circulate an initial informal paper on Hardware Requirements in advance of the upcoming F2F in order to form a basis for further discussion. 368.6 AS/DB to circulate an initial informal paper on Tier-1 Role and Requirements in advance of the upcoming F2F in order to form a basis for further discussion. 368.7 TD to circulate an initial informal paper on Technical (Middleware) Support in advance of the upcoming F2F in order to form a basis for further discussion. 368.8 JC to circulate an initial informal paper on Deployment Support in advance of the upcoming F2F in order to form a basis for further discussion. 368.9 GP to circulate an initial informal paper on Experiment Support in advance of the upcoming F2F in order to form a basis for further discussion. 368.10 TD to circulate an initial informal paper on Cloud Computing in advance of the upcoming F2F in order to form a basis for further discussion. 368.11 SP to circulate an initial informal paper on Financial Planning in advance of the upcoming F2F in order to form a basis for further discussion. 369.1: DB to draft a paper on Tier-2 structure for CB. --------------------------------------------------------------------- INACTIVE ACTIONS: Temporarily suspended because nothing can be done. --------------------------------------------------------------------- 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- JC would check the situation with Graeme Stewart (who was currently on annual leave). There would be no PMB next Monday (7th Dec) as it clashed for many people with AHM and other things. The next PMB would be the F2f taking place on Thursday 10th Dec at 9:30 am at Imperial.