CB Meeting 13 - 11/12/09 ======================== Present: Peter Watkins (Birmingham), Dave Newbold (Bristol), Peter Hobson (Brunel),   Ian Bird (CERN), Andy Parker (Cambridge), Nigel Glover (Durham), Phil Clark (Edinburgh), Tony Doyle (Glasgow), Jordan Nash (Imperial), Roger Jones (Lancaster & ATLAS), David Hutchcroft (Liverpool), Roger Barlow (Manchester), Ian McArther (Oxford), Steve Lloyd (QMUL & Chairman), Simon George (Royal Holloway), Norman McCubbin (RAL), Dan Tovey (Sheffield), Antonella De Santo (Sussex), Jon Butterworth (UCL), Dave Colling (CMS), Glenn Patrick (LHCb), Dave Britton (Project Leader), Sarah Pearce (Project Manager). Apologies: Chris Allton (Swansea) 1. Approval of minutes from meeting 12 ====================================== The minutes were approved. 2. Introduction - SL ==================== SL noted that the main business is GridPP4 preparation. We do however need to sign off on GridPP3 hardware allocations. The algorithm for first tranche was based on CPU measured and disk installed, effectively measuring Monte Carlo performance. Sites with multiple experiments gained inappropriate share. A renormalisation corrected this and was generally agreed. In addition ATLAS asked to redistribute CPU and Disk, with disk to be placed at well performing analysis sites. This was not quite cost neutral since the ratio of ATLAS disk:CPU is not the same as that globally (ATLAS, CMS and LHCb are different). This was somewhat contentious but RJ talked directly to most ATLAS Group Leaders. Since the proposed outcome was circulated to the CB (10 Nov) ATLAS made a minor changes affecting one site. This needs to be signed off, since unallocated funding in the present climate is a problem. RJ adds that conversations within ATLAS were more about the general process of GridPP metrics i.e. negative feedback for small sites, not about the introduction of analysis sites. IM (PP Brian Foster) notes that account should be made of value for money for GridPP4, and also revisited for the GridPP3 allocations. This should incorporate manpower effectiveness.   PH notes that manpower effort is recorded in the quarterly reporting. AP notes that GridPP3 allocations should not be changed, but the metrics should be revisited for GridPP4. DC notes that many people were involved in establishing the GridPP3 allocations and these shouldn't now be revisited. RB notes that Manchester, accepts the unfavourable outcome under protest, so as to move the process forward. DT agrees. SL concludes that the allocations for GridPP3 are approved. 3. GridPP4 Planning - DB ======================== DB prepared a report relating to response to the GridPP4 call. The invitation to submit was consistent with the current mandate, over a 4-year period from April 2011 - March 2015, subject to overall STFC funding constraints. The longer project period was considered helpful. GridPP3 ends in March 2011 so the timetable required an outcome to be known by October 2010 which set the timescale for response. An internal draft is needed by 11th January 2010, prior to 15th January, PMB F2F meeting. First 4 weeks used to prepare discussion documents - some are on the CB web page. 1. NGI interface - a body that needs to be formed in the context of EGI, built upon the existing NGS and GridPP. 2. Tier-2 3. Project management 4. KE and EI 5. Hardware requirements up to 2015 - difficult to do, but now being approached 6. Role of the Tier-1 and its delivery to the experiments 7. Technical support 8. Deployment support 9. Cloud computing 10. Financial planning   The Tier-2 structure was noted as of primary importance to the CB and hence this was being raised. 4. Tier-2 Planning - DB ======================= DB noted that arguments for the widest possible engagement were strong at conception. This is more difficult in going forward especially in the current funding environment. ATLAS and CMS input provided a strong steer that the most important requirement is a smaller number of well-run sites. This drives the GridPP approach. GridPP is, however, not proposing to radically change overall but it is proposing to provide a minimum of two people primarily to enable a reasonable response in cases of absence. The proposal is to dedicate 15 FTEs to core sites and 6 FTEs to engage wider support to balance the two effects. The request needs to be set in the next few weeks. Value for money should be a key indicator. Analysis is really the focus and a sensible set of metrics are needed to enable a sensible hardware distribution. DB requests approval for the "core site" concept. There will, however, be hardware funding for other sites. DN comments that the requirements of the experiments is to engage the maximum number of people. GridPP has to do its utmost to engage people at all points. This needs to be seen in the context of system management losses in the rolling grant. DC responds that the CMS view is that a minimum of two people are needed to run an analysis site. Within that there is a plan e.g. Bristol will access RAL-PPD based resources. DN notes that GridPP have been historically weak in providing cross-site support. DN and DC agree that cross-site support can and should be improved. AP notes the best impact is driven by the experiments request to drive a small number of sites. However AP also adds that it would be sensible to plan for strength in depth. Placing manpower into experiment-based user support is the best way to address support issues. SL notes that this could have happened already. AP notes that "core sites" should fully understand the role that is taken on. DB notes that "all eggs in one basket" is being addressed and will be used in arguments to PPRP. It is up to experiments to define the appropriate minimum size for manpower support. There is still flexibility on the exact numbers. Basic argument is to "up the game" and the need to respond on this basis. RJ notes that within ATLAS the algorithmic answer for the UK is 10.9 FTE, arrived at independently (as of today) in the international context. (This may be recognised as an M&O contribution). AP notes there will be grants cuts and reductions in GridPP4. Survival mode planning. Can we bring small sites up to speed? RJ that this is possible, but more robust arguments are needed for the bid. The current ATLAS production team is overloaded and too small to manage a large number of sites. The majority of effort goes into the small, generally low-performing sites. It would be a strategic mistake to allow this to continue. NM notes that nothing in GridPP4 should preclude the AP proposal to enable small sites to grow. Relatively small amount of effort will leverage these additional resources. This would however need to be quantified in the proposal, based on individual site arguments. JN notes that physics RAs should work on physics and not be re-directed towards maintaining small sites. AP notes, however, that on a well set up site the occurrence of problems is small, however the problems are not just system management, but also user intervention is required. DC concurs that we will fail if effort is not combined. SL agrees but this is additional and beyond the scope of the GridPP4 Tier-2 sysman proposal. SL adds that middleware will be reduced further in GridPP4 and the Tier-2 request should be seen in this context. JB notes that blurring between rolling grant support and GridPP is the wrong approach in the proposal. DN, however, notes with no sysman support GridPP needs to provide active support to non-core site. SL responds that the problems are usually in the experiment stack. There is widespread agreement that STFC are undermining overall delivery by requiring cuts to be implemented in parallel to the GridPP4 submission which make sensible overall planning impossible. RJ notes, however, that many of the issues are associated with Tier-3 - not part of a GridPP proposal. AP cites the example of Cambridge where, having previously invested, the outcome through GridPP4 will need to be explained higher up within the University. DT notes from the Sheffield perspective a small, well-performing site is not a lot of effort to maintain. SL and RJ respond that an attempt is being made in the GridPP hardware allocations to keep such sites running. SG asks when would the manpower ranking appear? DB notes that this is planned for January, using experiments inputs. AP asserts that a dynamic site ranking is needed. If later performance proves small sites can be productive then future manpower can be allocated there. RJ agrees and notes that this is the strategy of the amber site. DB notes that "core site" will not be the public label used in the proposal but the concept should be clear. Aim is to add strength to the basic arguments discussed here. AP agrees that "core site" is not the correct terminology. Going forward AP asks, in second phase what will be the resource optimisation - what is the right algorithm? DB responds that a more responsive algorithm is fine for the hardware allocations. AP asks if it is really hard to add/define "Value For Money" SL responds that the current algorithm is largely MC-based. This needs to become analysis-based. DH observes that the discussion thus far has been mainly ATLAS, a little CMS, and no LHCb input. DB responds that the hardware weights are 60% ATLAS, 30% CMS and 10% for everyone else (inc. LHCb) at the Tier-2s. GP notes that LHCB are happy with multiple small sites for MC production, provided this can be handled efficiently. PW asks if the JES forms really need to be completed in January. DC notes that ATLAS Hammercloud tests already provide analysis-based evidence for the manpower allocations. DB understands the issue, but there is a phase change in GridPP4 driven by data from LHC and the funding crisis which forces decisions to be made earlier than desirable. JN notes that a stronger case is needed if a manpower threshold is established. JB notes that GridPP needs to stand as a separate project distinct from RG. PW re-states that this is a challenging job to make this by January. JB notes that one JES form can be submitted on the advice of STFC. AP cites the example of the LHCb upgrade proposal. DB notes that he is aware but the allocations would still need to be made even if the individual forms per se are not required. DB will, however, re-discuss with Tony Medland, exploring the possibility to enable the actual site allocations to be made prior to PPRP presentations in April. DB noted, however, that the strongest possible arguments will need to be presented at that stage. AP notes the need for consensus in the CB. NM states that he has heard no good argument for simple division - a safe, credible case needs to be put forward and the approach is clear. We must try to avoid the AP problem due to undue haste leading to resources lost forever. RJ notes that for ATLAS the Hammercloud tests will form the basis of the metrics i.e. the ADC tests and their adaptations. SG refers to the current 3 specific test for ATLAS RJ responds that that the current evaluation method is to use more than information from the 3 tests. In the longer term RJ proposes to track ADC (ATLAS-wide) methods. SL notes that the tests have evolved over 7 years, leading to a reasonably robust hardware algorithm. DB recognises that the hardware is constrained by the analysis or MC role of the sites. There is more time available to discuss and agree how this is done, but it is under control. SG wants flexibility to use funds for manpower rather than hardware. DB notes that previous issues reflect bureaucracy - GridPP has no problem with this approach. JB adds that the service aspect needs to be explored with universities. SL cites the cases of UCL, Edinburgh and Bristol. SL concludes that the manpower model is OK provided it is left as long as possible (not constrained by JES). An experiment-led manpower algorithm will be adopted. DB requests that this principle be agreed. General Consensus on this approach. Next CB date: after PMB on 15th January, but in advance of the 28th January submission deadline. Action: SL to set up a Jan 18-22 doodle poll. Done.