GridPP PMB Meeting 630 (F2F)

GridPP PMB Meeting 630 (05.04.17) F2F
Present: Dave Britton(Chair), Jeremy Coles, David Colling, Pete Gronbech, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Tony Cass, Pete Clarke, Tony Doyle.

1. Introduction (DB)
DB summarised the meeting was in 2 parts around the theme of evolution: to determine how we steer the project: Tier 2 (DB circulated a document for comment) and Tier-1. Only 2 responses were received to the document: one from Simon George who adapted his presentation in light of the content – this has been rescheduled into an earlier slot on Day 1. The second comment, from Paul Newman in Birmingham, suggests changes to numbers/kit, and noted the comment on supporting ALICE at Tier-1 – this is an issue that requires resolution in a wider context, he also asked how ALICE requirements will be factored into the document. PG noted a recent invitation from ALICE to their European workshop. Additional changes have been made to this living document which has been evolving since December 2016 and much of the discussion surrounding ATLAS has been ongoing since then.

2. Oversight Committee Meeting Documents
PG noted Tier-2 evolution dominates the meeting, and some Tier-1 topics need to be covered. The OSC documents were discussed (the OSC has been pushed back to June and the following meeting scheduled for November). They have requested a Progress Report and any actions from the previous meeting to be addressed. The customary documents will be sent as these cover all requirements. The Project Management Plan is only produced once at the commencement of the project and does not need to be reproduced unless significant changes have taken place. A written detailed report requires to be submitted 2 weeks prior to the meeting and also slides for the presentation that will be given. PG went through all the other elements of the report and confirmed relevant PMB member responsibility. The timetable has now shifted to 16 June and 31 May is the most appropriate submission date. Therefore preparations need to begin in early May.

DB confirmed a list of strategic items that need to be dealt with and should be in place, e.g. interim statements and quarterly reports (now due). Special items to be addressed and may require separate discussion include:

a) Work to determine pledges in September for 2018 – this requires some modelling, including input (costs from this procurement), range of possible costs for the next procurement taking account of exchange rates which are currently fairly consistent, estimates of requirements and affordability. If issues arise they should be raised to the OSC;
b) Oracle, CTA and tape planning. There is little cost within GridPP5 but this should be flagged as a potential issue for the end of GridPP5. It was noted that a fixed profile of spending in previous projects affects the tape modelling, such changes have been appropriately responded to in the past;
c) Echo should be reported on;
d) Tier-2 Evolution is also extraordinary, e.g. Tom Whyntie’s post reformatting, but all aspects will not be known by the OSC meeting;
e) Tier-1 – this should be explained, i.e. how it moved from 17.5–14.5, some decisions will require to be made;
f) UKTO – no capital project may be unstable, GridPP may not be able to continue to meet specific demands and questions should be asked on how that is managed, e.g. extra manpower/resources.

Timescales – h/w planning and pledges should start within the next 2 weeks to begin the process of preparing the documents from 1st May. DB may not undertake the planning as his models are very high level, more detail is required for 2018 – AS and PG will do this after Easter and couple to the plans and decisions on funding for Tier-2s (2019-20).

ACTION 630.1: AS and PG will commence planning and modelling for OSC documents and couple to plans and decisions on Tier-2 funding (2019-20).

3. Tier-2 Evolution Document/plan
A fundamental issue is sites have developed in a linear way preserving the ratio of disk:CPU However, what happens beyond GridPP5 is unclear and it may be challenging to provide manpower at all the currently supported sites. This may mean we do better to focus some sites on just CPU with a few, larger, sites which additionally support disk. The PMB discussed the Tier-2 Evolution document that had been circulated to the CB.

4. Metrics and Funding Strategy
Evolving Tier-2 sites for the end of GridPP5 means that in makes no sense to simply allocate Tier-2 funds according to performance-metrics; the allocation needs to help the evolution. DB summarised models/statistics on spends, pledges, procurement (CPU and disk) and projections based on identifying sites that could be regarded as primary, secondary or not-associated with specific experiments. DB compared this with previous metrics used. Ultimately, experiments should define their disk requirement and ways to incentivise. For CPU, funds can reflect work per site but required to be balance disk. Incentives and consequences were discussed, but this is a complex process according to workflow, CPU/disk and pledges. DB will tweak the model by reversing the algorithm to determine consequences and modification for different CPU requirements. There was some discussion on amounts of supply, pledges and other incentives to support a high numbers of VOs. It may be possible to highlight staff costs per core hours delivered and there is a Project Directors’ initiative that is currently looking at the full cost of providing compute. It is challenging to factor in some aspects, e.g. costs of building datacentre some years ago or electricity costs. CMS are unconcerned where the disk is run so long as it is effective and efficient, but it was recognised that if some larger sites have no disk and rely elsewhere that may create too much pressure.

DB received input on what should be included in the report and his proposed talk tomorrow which will be amended. Many associated aspects will be raised in presentations during the collaboration meeting.

It was discussed whether the metrics cover requirements: they currently cover disk, quantity, time and CPU time. The metrics may change in the future, but will continue to separate disk from CPU and distribute funding accordingly to what the experiments require (disk) and according to metrics as far as is possible (constrained by providing CPU to support disk).

ACTION 630.2: DB and PG will continue to work on metrics and funding strategies at the macro level.

ACTION 630.3: DB will tweak his metrics and funding model based on CPU.

5. Tier-1 Issues
AS summarised a working document on Tape cost for Tier-1 in the light of recent Oracle announcements much of which has previously been well-discussed.

CERN is planning to end support for Castor and is moving to CTA (CERN TAPE ARCHIVE. A meeting is scheduled in May to discuss how we may be able to get involved in CTA. This would be a large job and more challenging with less manpower in GridPP5. It is more likely that other solutions will be investigated.

The meeting with CERN will give clarity and one discussion with a commercial provider would provide an alternative perspective.

There was some discussion on 60% increased pledge for 2018, CMS globally are 24% below what was required which enabled pledges to be met, though this was a challenging time.

Echo – Alistair Dewhurst will present tomorrow, but the project is on track and now delivering significantly to ATLAS (about 50% of the capacity still in Castor). Alison is keen to process a change request to make Echo a production service and this looks acceptable, but awaits approval. The review meeting for this led to a requirement for full risk analysis to ensure input and facilitate signing off by ensuring the PMB are involved and satisfied. AS and Alison undertook risk analyses of all the implication which was useful, particularly in helping to ensure connected issues are considered more widely and mitigated where appropriate. Critical issues to be considered around key themes were highlighted, e.g. functionality and data retention, cluster maps, availability, performance. Key performance issues risks include hot spotting, cluster mapping, file size and file open transaction rate limiters. The PMB expressed were reassured the team are looking for potential risks as opposed to managing issues arising.

Possible services to retire to save manpower at Tier-1, WS, LFC and Frontier – 3 FTE was discussed at the review and each saved some effort. There is no support for WS. T2K use LFC and can be moved over to the DIRAC catalogue. Investment has to be made in manpower now to save manpower later, discussions should be taking place with Jarek Novak. One WS should be ran as a backstop, working towards ultimately shutting it down. There may be issues elsewhere and it is helpful to have WS available to use as a stopgap. Frontier Service is used by ATLAS – 3 or 4 sites run the service, a statement from ATLAS may be useful to determine the importance of Frontier. LFC and Frontier share the share infrastructure and there are benefits to cutting both.

Overbooking of manpower to ensure meeting the target FTE: – there is a slow recruitment process at STFC, but it comes down to who takes on the risk if we go over, extra costs coming out of other projects has wider implications. We generally undershoot the targets. The Tier-1 manager post has currently not yet progressed and AS is working up the job spec. The Production Manager post has been interviewed and a preferred candidate selected. The production team member post is being advertised and the Database team is at staffing level. The issue on overbooking stems from being unable to move money forward at the end of projects, if we are stepping down effort in Tier-1 in GridPP5 we cannot afford to provide less effort. We should take the risk of overbooking by having an employee in post for the moment, this is a transition issue that needs to be sensibly managed. The OC should be advised the transition will be addressed appropriately over the period where we can staff against, i.e. not recruit against. AS will consider how best to progress this and whether a formal statement should be worked up. The role of the production team was discussed as there is an opportunity here to redefine their remit. As the infrastructure becomes easier to run the role of the production team has changed and consideration needs to be given to where changes could/should be made, e.g. admin on duty rota each day ties people up significantly and poses risks. Establishing working patterns for CEPH will assist in this regard and some progressive rebalancing as to boundaries between the production team and others. There is alternative funding working in the same areas but there is not much overlap so this does not reduce costs, but this requires further consideration. The next recruitment needs to consider a sys admin.

Network metrics – this has not been bottomed out, at the review it showed high availability of core networks. This is to off-set network issues being raised e.g. previous Castor concerns. Thus it is crucial to keep on top of these issues such as low level network losses that continue to need investigation and to have the information available to take to the network group and ensure they remain responsive. Metrics were gathered and demonstrated no significant effect on operations. Packet loss information could be produced, but this does not supply the required information. Perhaps a series of milestones would be helpful, e.g. IPB6. Working out the attribution of our downtime to network issues is a very helpful information to have since team members identify issues and manually track to find root cause. It may be helpful to examine high level milestones required for network developments, e.g. IPV6 development and these need to reflect WLCG requirements – PG will consider this further. The matrix on the measurement between RAL made different from other experiments can be shown in different colours.

ACTION 630.4: RJ to provide a statement from ATLAS on the importance of the Frontier Service.

ACTION 630.5: PG will consider high level qualitative milestones required for network developments.

a) RJ noted in ATLAS and computing and software, the new person wants institutional commitments to computing tasks (classed as Class 4 service activities). These are suggested as GridPP commitments, providing us with flexibility to do tasks (e.g. running UK cloud for ATLAS). This was raised previously, but should be carefully worded. DB noted if he compiles a table with all the institutions, he has separate entries for GridPP. The responsibility to continue to deliver this rests with GridPP – this could advantageous as after GridPP there is something to deliver or evaluate effort.

b) AS spoke to Dave Salmond who would like to have attended and would like to sign up to UKHEPGRID. The PMB have no objection to this.

c) At the cloud working group it was suggested set up of a through-put activity with DC and Dave Salmond as contacts, it would be helpful if a working group was established before the next meeting. The PMB did not object to this suggestion.

620.1 DB to contact DK re the procedure to deal with a security incident and the media. DK will send an interim statement to PMB in case required in future – spokesman SL as head of board or DB as project leader – most instances it would move to STFC and involve the Press offices of the relevant institution DB done. DK – Done.
628.4: LC will check availability w/c 25th September at Durham (27-29th) for GridPP39. (Update – RJ has contacted the Lancaster conference team). Ongoing.
629.1: PG, AM, JC and DC will discuss and finalise the GridPP38 session on site surveys. (Update – session 3 on Thursday, AS will swap his Friday slot with Alistair Dewhurst since Alistair cannot attend on Friday. In summary, we won’t mandate approaches, we want to smooth boundaries between Tier-2 and Tier-3 and to smooth transitions). Done.

NEW ACTION: LC Will check 16-18th April for GridPP40 at Durham (beginning of the week – CHEP is later in the week and wait to see when IOP).

ACTIONS AS OF 05.04.17
628.4: LC will check availability w/c 25th September at Durham (27-29th) for GridPP39. (Update – RJ has contacted the Lancaster conference team). Ongoing.
NEW ACTION: LC Will check 16-18th April for GridPP40 at Durham (beginning of the week – CHEP is later in the week and wait to see when IOP).

630.1: AS and PG will commence planning and modelling for OSC documents and couple to plans and decisions on Tier-2 funding (2019-20).

630.2: DB and PG will continue to work on metrics and funding strategies at the macro level.

630.3: DB will tweak his metrics and funding model based on CPU.

630.4: RJ to provide a statement from ATLAS on the importance of the Frontier Service.

630.5: PG will consider high level qualitative milestones required for network developments.