GridPP PMB Minutes 371 (4.01.10) ================================= Present: David Britton (Chair), Sarah Pearce, Andrew Sansum, Tony Doyle, Jeremy Coles, Pete Clarke, Roger Jones, Robin Middleton, John Gordon, David Kelsey, Steve Lloyd, Glenn Patrick. Apologies: Dave Colling, Neil Geddes, Suzanne Scott 1. GridPP4 Proposal Outline (DB) ================================ DB had circulated an outline version of the GridPP4 proposal. DB noted that previous sub-titles (GridPP3: From production to exploitation) made it difficult to think of a new one for GridPP4, but one that fits is Òtowards a sustainable infrastructureÓ. Pete wondered if we should be more direct and put something less ephemeral that does not indicate a beginning, middle and end, such as ÒComputing in the LHC eraÓ. DB and other liked this. It is likely that whatever funding request is put forward will be cut. There will also be questions about what was cut already to get to this proposal (i.e the original 20% cut). A draft of the financial model is being worked on and will be circulated ahead of the Glasgow PMB. Need to understand the schedule for when we would pull out of ALICE (assumed no new ALICE hardware for GridPP4?). Section 3 p2 (experiment motivation): Non-LHC experiments. Looking at 2009 accounting about 14% was used by non-LHC VOs. Explicitly mentioned in Tony MedlandÕs invitation to bid. GlennÕs feedback from experiments generally fits within a proposed 10% allocation except that UKQCD have requested a lot of tape in the later years. Section 4 (International Context): Neil asked about submission dates. DB: around 4th March. NG: WeÕll find out about who gets beyond first round for EU funding by end January but EU will not announce funding until about the end of March. DB thought risk of no EU funding was mitigated to some extent by inclusion of 2 FTEs to cover required work. Section 5 (GridPP4 Structure): 5.1: Main thing to discuss is the work packages table. Question about equal footing of Tier-1 (WP-A) and Tier-2s (WP-B). RJ thought the international obligations were the same for both Tiers, though RRB so far has focussed on T1s. (WP-B is 21-8 = 13 FTE). DB agreed to adjust wording. WP-C: Ops team plus NGI team (security Ð policy and operations and GOCDB plus accounting). Propose T2 effort into ~8 site leaders (moving away from management terminology). FEC features for WP-B whereas C are more skilled. Could rearrange. NG: Words themselves are emotive Ð could do with a crib sheet to understand the details. WP-D: Experiment support: Effectively 1.5 FTE per experiment plus an FTE for documentation and support. Security is in WP-C (operations); Networking is in WP-A (at Tier-1) and a tiny bit in WP-E (Grid Support). Security Policy work in EGI is not in Ops but it was in EGEE - OK to leave it all under operations in GridPP4. Question about whether WP D and E should be merged into a single Support workpackage?. Helps balance the work balance. TM request was for the experiment support areas to be explicit. Need better terms than ÒsupportÓ and ÒoperationsÓ. There are arguments for leaving it as it is and for merging! Consensus was towards leaving separate - revisit on 15th. Page 6: Metrics are implied to become more important than milestones. Is this an appropriate statement? Something for Sarah to look at. Section 6 (Experiment Requirements): Need something on service levels for non-LHC experiments (T1 and T2?). Why would we request resources at T1 and/or T2s for these experiments? Something for Glenn to consider. 6.3 Sensible LHCb numbers had still not been received. RJ recently sent new/explanations numbers for ATLAS. RJ's assumptions are for a 10% T1 and a 13.5% T2 in the UK. DB prefers to simplify argument that UK is 12.5% of the Tier-1 country authors so needs to provide a 12.5% T1 and this then requires a 12.5% T2 to be consistent with the ATLAS computing model. Uncertainty on numbers is large anyway! RJ: Experiments should all make same assumptions concerning 2011 shutdown period etc. p9: Are the UKQCD numbers too large to accommodate? Need to work within 10%. Can request their resources as a separate funding item. Decision can then be made by the review. This all needs to be explicit. 6.4 Cross-campus network links mean within a campus. Typically at the moment this is 1Gb. This is one of the reasons we need multiple Tier- 2 sites (i.e. could not run with 3 T2s due to lack of bandwidth). 6.5 Perhaps remove Appendix B and insert directly. Section 7 (meeting the expt rquirements): 7.1 UK- Tier1: Perhaps 3-4 pages for the Tier-1. Definition of teams and where the manpower goes. Important that this is justified. Appendix A will be a description ofeach role. DB will provide rationale for the 24.6 FTE. 7.2 UK Tier-2s. Can say what we did in the past but then we need experience with analysis to decide on the hardware allocations. Was there any feedback from the CB after the last PMB F2F? The CB wanted to delay the assignment of manpower for as long as possible (agree in principal), but the case will be stronger if we can justify what each post is doing at each institute. Of course there needs to be flexibility in the system in case circumstances changed. This all needs to be signed off by the CB. Perhaps one will take place after the Glasgow meeting. [Dave to send spreadsheet to Steve Ð Roger to confirm figures! Private until it goes to the CB (say 20th Jan]. We propose something now and the CB needs to agree or not. ATLAS decides the sites and this dictates allocation of x manpower. Has to go to the ATLAS CB on Friday. [This CB is similar to GridPP CB but not identical]. What we said previously is that we can see the value of some small fractions at some sites. Guiding principals: want a number of well founded site; put small fractions where they have a big benefit and distribute the rest as most effective. PMB F2F: Offline people email Dave whether they will be in Glasgow by Thursday evening. Dave also needs to circulate a mail about accommodation. Section 7.3 probably not be needed Section 8 (Grid Deployment and Operations): JC to develop Section 9 (Experiment Specific User Support): GP to develop Section 10 (Grid Support): Title may evolve. TD to develop. Note that the 4.2 is a reduction. Section 11 (Management): DB had re-written - Removed Deployment Board and replaced with Deployment Team. [JC to read] Section 12 ( E.I; K.E. etc.): SP to develop. Section 13: This is a sort of catch-all section to address things that cut across the workpackages and things that were raised in the invitation to bid. TD requested to review. (ok but query about EGI, ROSCOE etc.). Interaction between T1 and T2 activities. It is a check against the call and make sure we have explicitly covered them. HPC and Cloud Computing for example may need to go here. Looking through greyed outcomments plenty of things to touch upon. Perhaps title should change. Section 14: Resource request Ð SP and DB to iterate on costs. General simplify. Also need to look at strategy and contingency but also risks (the latter for GridPP4 and the upcoming OC). Drafts need to be circulated by early next week. Will have a PMB on Monday 11th to review progress. Appendix A: post descriptions. To be available. Perhaps create a table and have a one line description. Number the posts (SPs spreadsheet). Develop some exemplars. 2. Weekly Notes =============== LHC restart around 16th February. What are the consequences. CASTOR upgrade to 2.1.8? AS did not think that they would want to push for this upgrade. T1 team building a list of interventions (small). Gen instance CASTOR upgrade plan was for the end of March. Storage workshop request. Plan now to hold with GridPP24. Develop with the NGS? Agreed to fund at the £2K level. OC meeting we are working towards. There is a need for some documentation Ð to be discussed further at the Glasgow meeting. STANDING ITEMS =============== SI-1 Tier-1 Manager's weekly report [AS] Tier-1: Service ran well over Christmas. Fabric ====== 1) Lot 2 disk servers failed acceptance. Testing was encouraging before Christmas and further statistics have been accumalated over the holiday period, however we have not yet received the updated test results from the supplier this morning. 2) New procurements have started. - Disk order has been placed. Second tranch (April deliveries) now brought forward to February for one supplier and March for the second. - CPU order placed - 2010 tape media purchase now placed for delivery in this FY. 3) Procurement is underway for an additional 4*1Gb/s second OPN link to CERN as resiliant backup. Order rests with JANET - no ETA received yet. 4) The UPS room supply causes instability on our EMC RAID arrays. There will be a short "at risk) on 5th January to test switch over to the UPS bypass. This will allow us to also confirm that the noise on the current supply is caused by the interaction between the UPS and our capacitive load. -> Restart will be smoother since taking services down rather than running all services at risk. 5) Low humidity problems are forcing us to run humidifiers in the machine room to maintain humidity at acceptable levels. Service ======= 1) We ran successfully over the Christmas period with very few operational problems. Normal on-call team operated with reduced expectations on response time over "peak" holiday days. Additional routine inspections were carried out one site visit had to be made to address a disk server fault. We had load related problems with the WMSs - this appeared to trigger a bug and we are waiting for a fix. 2) There has been an occurence of BIGIDs on the ATLAS instance (caught by a monitoring trigger we run). Team investigated over the holiday period and concluded we should continue to run. Investigations are underway. This was the first occurence since the original fault was patched. - DB raised question about procurement levels. % hold back etc. AS to get deployable resource figures. - New instance of the big-ID problem seen over Christmas! Perhaps some patching in Oracle broke the fix. To be followed up. SI-2 ATLAS Weekly Review and Plans [RJ] Some user access problems at QMUL. Squid config issues. Other than versioning issues easy to deploy. SI-3 CMS Weekly Review and Plans [DC] DC was absent. SI-4 LHCb Weekly Review and Plans [GP] Disk server problem at T1 is being investigated. SI-5 Production Manager's weekly report [JC] Production report: 1) It was agreed that sites where there was not likely to be any intervention over the holiday should declare their site to be at risk. About half the sites did this (university buildings were locked over the period) and the rest ran on a best efforts basis. 2) No significant user or infrastructure problems were observed (there were a few outages such as a power failure for Bristol HPC and a serious air-con problem for Oxford). ECDF benefitted from the quietness of the period and a new fairshare agreement that means no charge for using an empty cluster. ... today however Lancaster has reported a water cooling problem that took out a disk server and approximately 1TB data may have been lost. 3) Initial feedback from ATLAS (Graeme) indicates that the T2s worked well (only Oxford is currently down). T1 reprocessing had a small problem that was fixed by adjusting the 3000M queue walltime limit. (Note: Before Christmas ATLAS decided not to subscribe data to sites that cannot provide 25TB in Data, MC and group tokens). 4) There has been a revised submission from the storage group for a workshop early this year (aims: review issues from first LHC data taking; training and discussion; develop consensus on emerging technologies). Costs have been brought down by co-locating with GridPP24 at RHUL. The cost estimate for 20 participants is £2000 (£40 accommodation & £30 catering and facilities per person). 5) WLCG T2 Availability:reliability figures saw improvement in November over previous months. London (93%:92%) ; NorthGrid (95%: 95%); ScotGrid (93%:93%) and SouthGrid (90%:81%). SouthGrid figures resulted from Birmingham (SE) and Oxford (various - inc. faulty network switch) problems. To note: A) Daily WLCG operations meetings resume today (https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsTemplat e) B) The Tier-1 UPS bypass test will take place on Tuesday 5th January. This will impact services like the LFC from 07:45-10:30. C) Just before Christmas we began a security review of GridPP sites. It is intended that sites complete the first stage this month. SI-6 LCG Management Board Report of Issues [JG/DB] No meeting. SI-7 Dissemination Report [SP] Deferred to next meeting. Review of actions was deferred to next meeting. ACTIONS AS OF 23.11.09 ====================== 348.2 JC to investigate whether the decrease in job success rate metric in the last quarter is due to time-outs at busy sites or due to job-aborts due to incorrectly setup environments. This was still in progress - DB noted that the next Quarterly Reports will help and possibly render the action redundant. SP asked that this remain open until the next Quarterly Reports. JC noted this is not possible to complete because the tests themselves were not valid for a period during Q3 and also the start of Q4. 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There were a few suggestions in the deployment team about how to progress in this area. It needs writing up and an implementation plan. 358.1 SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. SP reported there had been an email exchange and she had sent suggestions on how to move forward. JC had met with Andy Richards at RAL. One of the issues was uncertainty in relation to funding - SP needed more detail re resources and options for the future, also for EGEE-funded manpower at present and what we have signed up to in NGI. She was awaiting a response. DB noted that NG was trying to understand the proposals and the funding fractions. In defining GridPP4 we needed to define these posts and responsibilities. JC noted that JG had suggested going through the EGI proposal document for info. 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources,at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- JC would check the situation with Graeme Stewart (who was currently on annual leave). JC followed up with Graeme and the other experiments. A test was started but this area has been deemed low priority and further progress is not expected for some time. ATLAS see no issues with contention. LHCb are not intending to pursue anything in this area. A CMS discussion has started but again it does not appear to be urgent. If the experiments are not pushing this internally then there is nothing for the deployment team to follow up! 361.3 JC and AS to check Tier-1 and Tier-2 gstat2 results (in relation to SL5 having been discussed at the GDB). JC reported that he had checked this, but information from AS was still awaited. JC noted that there were issues with the results. JC noted that the action as worded was done weeks ago. It is an ongoing deployment team action on the Tier-2 coordinators to get the information corrected. JC action done. AS still to respond. 366.2 RJ to provide ATLAS HW requirements for 2011-15. RJ & DC had a preliminary discussion - they need to agree common profile, even if it is flat cash. 366.3 DC to provide CMS HW requirements for 2011-15. In progress. 366.4 GP to provide LHCn HW reqiremens for 2011-15. GP had started this. DB noted he needed the numbers for hardware costings and needed something soon to begin work. The deadline was 2 weeks for prelim. numbers. AS would look at them as well. 366.5 SL/DB to estimate what fraction of STFC funding goes to non- LHC groups. What about the theory side? 366.6 GP to invite input from Other Experiments. 366.7 DB (in consultation with AS) to provide HW-cost estimates for 2011 - 2015. DB was awaiting inputs. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. AS noted this depended on money costs. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? Early action on AS required. AS had sent tech questions round the team and would forward inputs when available. DC noted to the meeting that today was the 16th Nov - only 4 weeks remained until Imperial, by which time we needed to have made extensive progress. 366.9 RJ to confirm that ATLAS supports the use of Tape storage in the period 2011-2015. RJ noted they had a belief in the archival work but the cost was to be provided by the provider. Tape would have a front-end staging system. DB asked whether they might want to move to another model? We should not assume that tape will do. A statement was required. 366.10 DC to confirm that CMS supports the use of Tape storage in the period 2011-2015. 366.11 GP to confirm that LHCb supports the use of Tape storage in the period 2011-2015. 366.12 SP to liaise with AS to establish non-capacity costs. SP advised that discussions had started. DB noted a long-term question about the model. 366.13: SP to request and collect first cost estimates of posts for GridPP4. FEC and non-FEC posts need to be costed. The Tier-1 posts should be costed as accurately as possible as soon as possible since there is a large lever arm here. 366.14 DK to provide first estimate of average RAL post cost on the basis of the current distribution of posts/grades. Clearly this will need refinement once we understand the final mix better. DK had already started this and would have estimates later this week. 367.1 ALL: to send email responses/thoughts to DB, or to the list, on NGI issues discussed. 367.2 RM to fill-in the grey boxes on DB's UK NGI diagram of a minimal NGI, as to what NGS would be doing in the areas listed. 367.3 JG to contact Ian Bird directly, immediately, and ask for a clear formal statement in relation to multi-user pilot-jobs by the experiments. This formal statement was required immediately - we could not wait for this issue to be brought up at the next MB. 367.5 DB to send formal information round the community re multi- user pilot-jobs, once clear statements had been received from Ian Bird (via JG) and the experiments (via JC). 367.6 RJ to submit a proposal to the PMB for funding assistance for the next ATLAS tutorial. 368.1 DB to circulate an initial informal paper on NGI Interface in advance of the upcoming F2F in order to form a basis for further discussion. 368.2 DB to circulate an initial informal paper on Tier-2 Structure in advance of the upcoming F2F in order to form a basis for further discussion. 368.3 SP to circulate an initial informal paper on Project Management in GridPP4 in advance of the upcoming F2F in order to form a basis for further discussion. 368.4 SP to circulate an initial informal paper on Economic Impact, Knowledge Exchange and Dissemination in advance of the upcoming F2F in order to form a basis for further discussion. 368.5 DB/AS to circulate an initial informal paper on Hardware Requirements in advance of the upcoming F2F in order to form a basis for further discussion. 368.6 AS/DB to circulate an initial informal paper on Tier-1 Role and Requirements in advance of the upcoming F2F in order to form a basis for further discussion. 368.7 TD to circulate an initial informal paper on Technical (Middleware) Support in advance of the upcoming F2F in order to form a basis for further discussion. 368.8 JC to circulate an initial informal paper on Deployment Support in advance of the upcoming F2F in order to form a basis for further discussion. 368.9 GP to circulate an initial informal paper on Experiment Support in advance of the upcoming F2F in order to form a basis for further discussion. 368.10 TD to circulate an initial informal paper on Cloud Computing in advance of the upcoming F2F in order to form a basis for further discussion. 368.11 SP to circulate an initial informal paper on Financial Planning in advance of the upcoming F2F in order to form a basis for further discussion. 368.12 ALL: comments on Tier-2 structure to be sent to DB. 368.13 ALL: comments on Project Management to be sent to SP. 368.14 AS to iterate with Gareth in relation to actions required for downtime communications. 368.15 GP, DC & RJ to provide experiment input to DB/AS for 'Hardware Requirements' initial document for discussion at Imperial, which DB/AS would prepare.