GridPP PMB Meeting 644 (F2F)

GridPP PMB Meeting 644 13.09.17 F2F
Present: Dave Britton(Chair), Tony Cass, Pete Clarke, David Colling, Pete Gronbech, Roger Jones, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Jeremy Coles, Tony Doyle, Dave Kelsey.

1. Intro and Tier-2 Evolution Document
DB thanked RJ for hosting the GridPP meeting in Lancaster. He noted Anthony Davenport will join the PMB at 4pm. Tier-2 Evolution document circulated yesterday – the discussion should continue to ensure people were aware of the trajectory as tangible actions need to be taken at a later stage of GridPP5 and must be prepared. This will be covered in DB’s talk tomorrow – v7 of the Tier-2 evolution document contains a wording change relating to the model for allocation of CPU, which follows AM comments at the last F2F. However, there is a caveat built in that experiments must sanity check distribution for potential unforeseen consequences. The PMB are satisfied with the wording and agree. Alice details could be added to the document after consultation. Similar sized experiments in the UK, e.g. NA62, need to be supported and services should be expected to be run. Tier-2 HW money will be discussed soon and the document sets the scene for what should be included the GridPP6 proposal. DB notes that Tony Medland has experience here and DB is hoping to discuss the whole issue of GridPP6 with him at some point.
The second tranche of Tier-2 HW money for GridPP5 also needs to be planned but should be consistent with the longer term evolution plans.
PG noted some accounting anomalies (eg Glasgow) which need to be corrected.
ACTION 644.1: DB will discuss funding and resources with Tony Medland.

2. Tier 1 Procurement Status
PG summarised procurement for Tier1. Disk will be two lots (to mitigate risk of >£1M purchase). We have recently done well on disk but there were issues with deploying CPU. The team are aware costs need to be reduced and decisions need to be made. DB suggested changes in procurement policy that change the risk-mitigation strategy should be raised at the OSC for clear guidance so this needs to be built into any decisions. AS is also looking at how effort can be cut at the Tier-1 in order to meet the falling effort profile in GridPP5. DB noted professionalism is paramount and we cannot compromise on quality – there are lots of factors to consider. This could be a Tier-1 review theme and then discussed with OSC. GridPP is known in STFC and WLCG as being very well organised.

3. Tape Usage v Allocation
Previously alluded to – requirements and actually usage by LHC VOs are diverging. Plots on the tape-usage graph were taken regularly over the last few years – there is divergence between allocation and actual use. The final step up shown in the lasts plots does not exist in practice as the tape is “provisioned as required”. The 4 LHC VOs were discussed – CMS are close to target, DC noted CMS will be delighted to know there is extra space here to be used up. LHCb is probably unlikely to reach the allocated level. Atlas advised they will not reach the allocation due to change of emphasis following change in computing management. AS and PG noted £354K put aside for media this year will be put onto disk and CPU as it must be spent. RJ and DB discussed with Simone (ATLAS Computing) and a formal decision needs to be taken on how to progress, i.e. when pledges are submitted we should not put in an increased pledge for Atlas, but decide whether to reduce the request. RJ is discussing with Simone tomorrow.
The current difference between used and allocated equates to £500K of tape media. Total tape HW plan varies over the last 5 years, we are c. £400K of media overspent that could be extremely useful in other areas. On the Tape drive front budgeted c. £600K and moved T10KE money to disk, but we cannot spend this and could move it into this year’s disk and CPU lines, though this would have a later impact.

4. Tape Plans
AS summarised – news that Oracle was dropping the T10KE was concerning but this can be mitigated. We can reach the end of GridPP5 with existing T10KD, existing media is sufficient for FY17. Current robots are cheap to operate and already paid for – we may require a budget for new HW and deploy effort needed for a second issue (Castor replacement). The situation is not static and has changed this week. There was some discussion on the challenges of this for planning and finances.
Castor Replacement – CERN plans to EOL Castor is a major challenge – the deployment of new tape management system project needs staff and funds. Viable alternatives are being considered – IBM and HPSS provided some idea of costs ($140K to buy then similar to run p.a. which traditionally remains relatively static – they come in and manage the project). A full survey has not yet been undertaken but there are very few candidates and no volumetric charging. HPSS differs to TSM storage but some big sites have used this for years and have an object storage facility. RAL (not GridPP) managed to acquire £1M capital for tape system to replace Castor (not GridPP) – requested for FY18. Janet has enquired if we can purchase in FY17, but this is highly unlikely. This could be used to buy the front end cache storage system (possibly a new front end but not clear), some tape servers (using existing robots and drives), 3 year maintenance and some disk. Currently attempting to re-profile and bring forward capital spend from next year into this (c. £310K) and attempting to ascertain other ways to spend the other £700K – tendering and procurement is very tight in such a short term. We need a method to capitalise projects – e.g. provide to a consultant to undertake the project, but that creates other issues regarding paying for goods not yet received.
There have been discussions with CERN around potential end date for support for Castor. We don’t want to carry the cost of running EOS, could we support Castor for a while?
Echo service – this has been challenging but the project is going well and risks are falling with all issues arising having been on the risk register and retention risks are decreasing since data has been successfully retained. CMS has not come on yet. On a positive note, the team are identifying issues arising and finding fixes, e.g. data loss. Some new disk drives threw errors as data was being moved on to them which has an impact on storage groups – the team have found fixes and this will come out in the next release. Mixed story on VO porting by end of FY17 – Atlas nearly complete, CMS beginning to progress, LHCb does not seem to engage pro-actively and momentum often appears to lie with us, though there are reasons for this because they have systems that work at the moment. Alice is a smaller problem but will be discussed this week. Castor disk running beyond March is a drain on the storage team, though does deliver MoU commitment on old non-CEPH capable HW.

5. Capacities and Pledges
PG noted for the Tier1 there is additional money from Tape and capital redeployment agreed with Tony and he and AS processed these costs and pricing estimates – looking £492K on CPU; £1.4M on disk. CPU – delivers 108% of original pledge and 100% of disk. By not spending on tape media and contributing to disk and CPU we will meet requirements based on pricing estimates – perhaps 90% would delivered. DB, AS and PG will go through the figures in detail. Phase-out and other issues also have to be taken into account.

Tier2 pledges have not yet been worked out, but this will be discussed very soon and estimates need to be updated.

We need to be prepared for future FYs – 2019 cost models need to be updated for the next OSC based on new realities and put into context for the remainder of GridPP5.

6. RAL Posts (Tier-1 Manager, Security officer, others?)
Production manager – offered to a very good candidate and pending contract, likely successful outcome.
Production team – interviewed and an offer will be made – may try to fit in a cancelled interview for a very good candidate (last round 2 candidates rejected our offer).
Tier-1 Manager – paperwork finalised likely to advertise internally in October.
Echo team system admin (Bruno) has taken a year sabbatical (MSc) paperwork for replacement team member submitted.
CMS post – shortlisting complete and should be recruited soon.

There are issues here, particularly in the technical side, in recruitment and retention – salary is the primary issue. A case has been made for retention allowance which is spread around evenly – currently 5% uplift across almost all staff. Andrew Taylor has agreed that SCD can recruit above funded complement. Apprentice scheme is beginning to deliver junior computing staff into the department. Agreement has been reached to make more active use of relocation allowance in future advertisements. Working with senior technical staff (Band E) to identify routes for promotion for staff traditionally blocked from promotion. Considering how to exploit positive action legislation to tap into female talent pool.

It is challenging to compete with industry where salaries are far higher.

This is a bid in process. Anthony has been invited as he will be in charge of computing for LHC and it is useful for him to see the scale and scope of GridPP.

8. OSC Feedback?
DB updated and summarised the feedback document he previously circulated.

3) Collaboration had met pledges – OSC acknowledged external factors – first time in 15 years we may be unable to meet any additional uplift in pledge. Maintain dialogue – the case was made by DB on this basis and Tony has been extremely helpful.

4) Feedback from latest CERN RRB and other agencies faced similar constraints. OSC emphasised the importance of international efforts to cope with high LHC performance in Run 3. We are taking data at normal luminosity in LHC. The OSC requested an update on the Formal LHC strategy for Run 3 – we have a future strategy forum meeting on 27 October and Eckhart has asked us to provide the UK view, which pre-empts the OSC, the OSC meeting will probably be scheduled for after that meeting.

6) Oracle tape issue – a formal action stating the collaboration to be involved and provide capital plan for remainder of GridPP5. AS confirmed there is a plan in place and costings can be run so we need to consider how to model predicts while addressing GridPP5 proposal and the actuality – we need to somehow re-profile (disk and CPU). AS & PG will work on this.

8) Office to confirm to capital/resource breakdown of funding for GridPP5, this was done.

9) Revise plan for GridPP5 ramp-down of staff at the Tier1. AS is putting together slides on this re current effort and plans/recruitment as discussed earlier.

10) CDT benefits for GridPP. AS confirmed there are discussions with Pete Oliver and others – they are not expecting anything to emerge until at least 2019, but we will need to cover at the next meeting. Anna is a CDT lead – received a letter of support and we could possibly use that in some way, i.e. if the bid succeeds there will be links with CDT and support. DB and AS will produce a statement on our interaction with CDT.

11) Dirac needs will be met up to summer but then new media will be need to accommodate at £75K – the collaboration with work with Dirac to resolve. In principle the money was approved and now needs to be captured – AS will progress with Mark Wilkinson.

12) Review of GridPP OSC before the next meeting.

ACTION 644.2: PG and AS will document plans to present to the OSC on costings.

ACTION 644.3: AS put together a starting plan for staff ramp-down.

ACTION 644.4: AS will progress capture of funds for Dirac with Mark Wilkinson.

9. LSE Project – DB
Will Venters from London School of Economics has been in touch with DB. He summarised that the Pegasus project studied GridPP in 2006-10 and published several social science papers with GridPP as a tribe, i.e. a distributed project this size with no direct line managers as opposed to an industrial model. Will has kept in touch and in February asked if the group would like to participate in a similar wider reach project – on Digital Interfaces between large distributed computer systems. Postdoc will do more Pegasus-like research with team members – DB has responded in the positive and noted the scale of our project has to be increased and expanded into other user groups who don’t have our level of expertise. The funding has been captured for the project from EPSRC of £7.5M and in-kind funding required from GridPP of c. £10,000 for 6 days over 6 years to cover expert time for the researchers to engage in discussions/interviews. The PMB agreed to proceed and noted the previous project was a valuable exercise. It would be valuable for them to attend some GridPP meetings– one significant benefit was building communities and this has wider implications in our field, e.g. SKA, LSST, etc. What is the difference between industry and scientific project – in scientific projects many participants have a shared scientific goal which forges cohesion and reinforces the shared goal and trust between participants for making different elements work. Co-operation was also a significant element.

ACTION 644.5: DB will respond to Will Venters confirming the PMB’s agreement to participate in the LSE project.

10. Anthony Davenport from STFC
Anthony talked about e-infrastructure and STFC. STFC are increasingly aware of how important computing is. STFC e-infrastructure strategy is draft but the strategic review is published. This acknowledges a potential tenfold increase in compute requirement over 5 years and no real understanding how that would be implemented or funded. A working group put together a 20-page strategy with c. 10 recommendations on closer collaboration, coordination between research areas and sharing of knowledge, better understanding of how funding is issued through compute throughout STFC. This strongly points to collaboration with other research councils and is going through the executive board and likely to be published imminently.

Unfortunately, funding is not increasing but areas of synergy will be opened up – flat cash is c. 20% decrease across the board, there are many large projects opening up across the sciences. There are very heavily used resources (40,000 computers in the UK and Dirac with possibly 2% headroom) – joining up resources is helpful but if they are not bespoke for that activity then they can be inefficient. The strategy has to be supported with more money for HW – AD suggested the burden will be eased over time through collaboration, i.e. bespoke equipment for different activities but sharing of knowledge where it is ubiquitous. Research software engineering can be used across the board.

LHC – computing was never properly costed and this continued through STFC and other research councils – i.e. projects approved due to scientific merits but not properly considering computing requirements. This is a long-standing issue where it was not anticipated – but it must be built into the project design.

AD confirmed this was a long process before a ‘national infrastructure’ or monitoring mechanism is approved or established. Ideally, projects that are peer reviewed and approved will have free access in the longer term. STFC want to make it a fundamental part of the research grant application – much like the Dirac system.

AS noted in the effort to produce a coherent picture, the word infrastructure is used a lot and HW is discussed but that does not build infrastructure. It is challenging to link up these types of disjointed resource and if a coherent infrastructure is to be constructed it won’t work with current staffing levels and placing compute into sites. AD noted initial bid had capital and resource equivalent but that was too aspirational – this document works on the next 3 years and requirements moving forward and must be done from the ground-up to identify common ground/requirements. PC discussed the 3 layers of staffing – operations staff (bottom), experiment staff (top), software infrastructure designers (centre). The HW could be outsourced, but software staff are essential – grant panels are scientists and do not always grasp requirements. The infrastructure needs to be understood. AD confirmed this is something STFC are aware of and ideas are being considered on how to mitigate this in the future once we know how computing is going to be dealt with.

DB commented that we were very pleased Anthony agreed to give a talk at tomorrow’s Collaboration Meeting so that the delegates see this joining up of infrastructure is essential to the wider context of their work. UKRI discussion is very useful to help people appreciate this. This phase of GridPP ends March 2020, next year we will need to begin preparations for the next phase of work. Each phase we had to review staff and it is important to show staff they are part of a larger picture despite GridPP funding not increasing in future. DB summarised how challenging it will be to establish an NEI aspiration. STFC is the lead in this project and has a much firmer handle on this than other research councils.

RJ enquired how STFC will link with UK and transnationally as there is no one solution to all these aspects for different experiments. This is acknowledged in the document, but how these integrate internationally is not yet fully understood. DB noted in WLCG leadership can be established and have influence leadership but not impose on others. AD confirmed having a common grouping/contact would make it be easier to influence and for cohesion.

PC noted concern that grant panels would consider it inappropriate for GridPP staff work for non-particle physics projects as this is perceived as negative and unnecessary, this diverges from STFC who see these additional projects as very positive. Currently staff may do this work and help projects, but STFC could assist in highlighting the benefits of this. There was discussion on who could progress this – grants panels are independent but they do have guidance to work from. One way may be to go via different funding streams to pick up the costs of contributing to other projects, though this may take some convincing of the Executive Committee. GridPP has a challenge because we are a committed project to PP, whereas Dirac are more flexible, though both have challenges in identifying what projects they will be involved in over the next 4 years.

638.2: AS will check when equipment is due to become obsolete and investigate legal and manpower of donation to the African Data Centre for Bioinformatics and Medical Research. (Update: AS is looking into how this may impact Global Challenge Research Fund – GCRF – which would involve a cross-Council bid) Ongoing.
641.1: AS will email Tony Medland asking whether a decision can be advised before September on the availability of STFC funding before the procurement submission deadline in August. Ongoing.
642.4: RJ will consider who to give a talk on LHC and CERN current status. Ongoing.
643.1: DB will go over the Network Forward Look document to make corrections and clarify where necessary.
643.2: DB will update the agenda with suggestions for extra talks at GridPP39.

ACTIONS AS OF 13.09.17
638.2: AS will check when equipment is due to become obsolete and investigate legal and manpower of donation to the African Data Centre for Bioinformatics and Medical Research. (Update: AS is looking into how this may impact Global Challenge Research Fund – GCRF – which would involve a cross-Council bid).
641.1: AS will email Tony Medland asking whether a decision can be advised before September on the availability of STFC funding before the procurement submission deadline in August.
642.4: RJ will consider who to give a talk on LHC and CERN current status.
643.1: DB will go over the Network Forward Look document to make corrections and clarify where necessary.
644.1: DB will discuss Grant funding and resources with Tony Medland.
644.2: PG and AS will document plans to present to the OSC on costings.
644.3: AS put together a starting plan for staff ramp-down

644.4: AS will progress capture of funds for Dirac with Mark Wilkinson.

644.5: DB will respond to Will Venters confirming the PMB’s agreement to participate in the LSE project.

644.6 – (during GridPP39 meeting) DB will ascertain what Biomed do and how they use the Grid.