GridPP PMB Meeting 573 (F2F)

GridPP PMB Meeting 573 (09.09.15) (Face to Face – Liverpool)
=================================
Present: Andrew McNab (minutes), Peter Clark, Tony Cass, Claire Devereux, David Colling, Pete Gronbech (chairing), David Britton, Steve Lloyd, Roger Jones, Gareth Smith, David Kelsey, Andrew Sansum

1. Introduction + GRIDPP5 status
===================================
DB and DK still sorting travel for GridPP5.

FEC cases ok for most, but not all, posts.

Uncertainty about when grants will be issued: seems unlikely to avoid redundancy notices. Assume things going ahead. Not certain till papers signed. STFC council is 29th September. CERN rebate issue will become clear before that. Hopefully will be formally accepted on 29th. Spending Review is going to be really late so can’t wait for that.

CG experiment computing post cuts. (Impact on LHCb certainly: ~30%) Where does that leave GridPP’s intention to suggest Ganga as solution for smaller experiments? What if LHC experiments ask for computing effort from GridPP? Risk of GridPP posts getting cut as experiment again.

PPRP bid for tools as a project(s)? Could make an EU-T0 bid for some “missing” tools support (eg Ganga). Matching funding for a Horizon-2020 project? What about UK-T0? Cut in 100% -> 90% move. Need to tell projects they need to find funding themselves (CSR?) for the work in this that they need. GridPP needs to successfully engage with the new projects to survive post GridPP5 (in whatever form.) Helping one of these other communities is something institues can help with, more so than more blue-sky stuff.

2. GRIDPP Metrics
=================
ATLAS metrics use REBUS and analysis/production jobs from ATLAS database. APEL not used.

What about sites (eg UCL) that use remote storage (ie QMUL)? Analysis jobs meant to credit a site for its performant storage, but if another site, how is that fair? What about measuring usage of storage (data
flows) rather than just capacity. But site isn’t in control of whether it gets hot data or not?
Just count storage and jobs run. What about using weighted wallclock instead of used CPU then? Could include a measure of the storage performance in the credit for the disk? Some kind of storage throughput benchmark?

Used to rely on CPU used in case a site’s storage was inefficient but now unfair to sites if jobs are using remote storage at another site inefficiently. How much should things be metric-driven in new GridPP5 world?

ACTION 573.1:
Steve Lloyd to chair new metrics group to look at this. Andrew McNab, Pete Gronbech???

3. GridPP4+ hardware allocation
===========================
Expect £3.2 million Tier1+Tier2. This covers MOU and our required transition/development plan involving CEPH and Cloud. Notionally £2.2 million for Tier-1, £1.246 million at Tier-2 (including PPD). 6PB of Ceph at Tier-1 allows parallel running during the year. Cost? £70/TB => 6PB. Extra CPU at Tier-1 will allow cloud development giving space for
T0 supported projects.

What to do about LHCb T2-D? Last time Manchester and PPD funded to provide LHCb disk. Now also have Imperial and possibly Glasgow and Liverpool as T2D sites. Consensus to fold T2-D funding into main LHCb as with other experiments. Use Steve’s Metrics for the GridPP4+ funding, and do something new next time. So LHCb allocation will purely be based on CPU this time, and pledges don’t require more at this stage. Snapshot will be done on 1st October: 2 years from last one.

As before rounding of grants less than 10K, which can’t be issued as capital. New hardware at smaller sites? Cut-off at 10K? Round up to 10K?
Encourage sites that are not continuing with GridPP staff/hardware funding to use money to consolidate with other local resources.

4. LHCb Tier-1 disk allocation
=========================
Downwardly revised profile for future years, in light of less than expected LHC running in 2015 and changes to archive policy.

ACTION 573.2
Andrew McNab to ask LHCb datamanagement for a projected profile of requirement within 2015 and 2016 for RAL Tier-1 planning.

Can we compare LHCb data usage profile and machine delivered luminosity to try to estimate things ourselves within the Tier-1? Probably. Problem largely goes away with revised figures when go into REBUS (~8PB rather than ~12PB). Don’t want to rush into changing GridPP tape figures at this stage. Can change allocations without changing pledges themselves.
Can figures be changed retrospectively in REBUS? Document is being sent.
We should follow numbers in REBUS.
2015 effectively a warm up for the accelerator, but extra years running in 2018.

5. Ceph
==========
Talk by Alastair Dewhirst on Ceph status (see agenda page.)

Summary slide: Grid Cluster is up on new hardware and being tested; Clear plan to ensure MoU is hit in April 2016 regardless of progress with Ceph; Plan in place to migrate ATLAS and CMS once Ceph is working.

Milestones of slide 12: not certain of hitting 1st October 2015 milestone of everything verified as working. Confident about ATLAS but may be other hidden experimental requirements that emerge. Why not use dCache? Don’t want to run HEP specific solution. dCache going to dCache on Ceph anyway, and Ceph has the functionality anyway. Would it be better to delay the January 2016 decision point? Procurement decision will need to be before that, but storage hardware itself is agnostic Ceph vs Castor? This years kit has RAID cards. Next year’s won’t have RAID cards so not suitable for Castor. Want to have ability to deliver pledge via Castor, which gives us a year to make sure Ceph service can be used. Start migrating to Ceph to verify it can be used at production scale. Can already meet April 2016 MOU with Castor? Perhaps need to buy about 20% of procurement with RAID cards, or buy RAID cards separately if needed. Want to have milestones/metrics about switching to relying on Ceph. Taking intense CMS and ATLAS activity off Castor makes it easier to keep Castor running.

6. Open Compute
================
Open Compute platform allows you to design motherboard hardware yourself and get it fabricated. Some large US departments have done this. We have experience (eg at IC) from building trigger systems. About £20K to build some prototypes. Would take longer if changes required. Need ~50K units to be cost effective. Probably only need 3 or 4 designs for CPU boards, plus a couple of storage machine designs. However, probably not helpful for GridPP due to scale required. What about WLCG and trying to do it at that level? CERN has looked at this but unsure of outcome. “Not wonderfully cost effective”. Doesn’t seem to be viable for us.

6. Smaller Tier-2 Sites
========================
Need to process for building a plan about transitioning the smaller Tier-2 sites. Vac solution exists. May be able to use Cloud solutions if site has some additional funding. Also need a lightweight way of doing storage. Want a recommended way of doing things and a plan for getting there. Need site running on Vac for longer term at larger scale to find all the hidden operational problems. Need storage solution too.

UCL. Sussex next? Sheffield too? But also larger sites (Oxford? PPD?).

Storage? Do we want it at small sites? eg QMUL storage from UCL. Implications for network? Have a working group to look at strategy, including some relevant sites and experiment input, and including Sam Skipsey’s ideas from GridPP34.

ACTION 573.3:
Andrew McNab and David Colling to form Tier-2 evolution working group: members in 2wks, terms of ref in 4wks, outline plan in 8wks.

Do we need to ask sites, via CB, if they wish to continue at 0.5FTE? They’ve already agreed via JeS forms. At the point we start issuing GridPP5 hardware money, then the commitment is being made beyond GridPP5 (because of the lifetime of the equipment.) That’s the point at which we need to get a new commitment from them.

6. AOCB
==================
MoUs for other VOs. For LHC experiments have WLCG; for EGI VOs with have its MoU. Don’t have the same for those within GridPP using our VOMS. Vital for UK-T0. Do we need a formal MoU? They need to have an AUP. MoU can include requirement to have credit in publications.

ACTION 573.4:
create/pick an MoU and other required documents: Peter Clarke to form a group with David Kelsey, Jeremy Coles, others?

Cut-off point for GridPP4+ accounting/allocations to be announced.

REVIEW OF ACTIONS
=================
568.1 DC to investigate the Open Compute Project and revert to the PMB with information about cost-savings, risks, and capital required. Done.
571.1 Maintenance – PG to do and DB to inform Tier-2 what we are pledging on their half – PG needs to answer so that DB can do this. Ongoing.
571.2 JC to speak to Frederic re GridPP35 talk. Done
571.3 DC to talk to LUX-Zeplin re ReidPP35 talk. Done
571.4 PG to talk to LSST re GridPP35 talk. Done
571.5 On RAL, need to know who is speaking on CEPH. DK took note. Ongoing
571.6 PMB members get their quarterly reports in by the next PMB meeting in 2 weeks. Ongoing
572.1 Regarding GridPP5: PG to liaise with institutes and check costs and corresponding RAL figures then do calculations based on staff costings – these must be done by Friday 28.8.15 or early w/c 31.8.15 at the latest. Also to check if Atlas propose any changes before sending (he is currently on 3-day course for ITIC). Done
572.2 DB to send note about consumables to STFC. Done
572.3 Regarding normalising the travel budget: DB will make a case noting all of the above and try to pare back travel costs as much as possible but taking account of the above issues. He will liaise with DK before sign-off with PMB. Done.
572.4 Re email from Sarah Verth: DB will construct a generic paragraph for the introduction for 3 cases based on the Glasgow posts with a set of headings and circulate to others. Done
572.5 Re GridPP35: DB will email UK GridPP about the costs for the hotels and confirm allocated hotels. Done

ACTIONS AS OF 09.09.15
======================
571.1 Maintenance – PG to do and DB to inform Tier-2 what we are pledging on their half – PG needs to answer so that DB can do this. Ongoing.
571.5 On RAL, need to know who is speaking on CEPH. DK took note. Ongoing
571.6 PMB members get their quarterly reports in by the next PMB meeting in 2 weeks. Ongoing
573.1 Steve Lloyd to chair new metrics group to assess how much things should be metric-driven in new GridPP5 world.
573.2 Andrew McNab to ask LHCb datamanagement for a projected profile of requirement within 2015 and 2016 for RAL Tier-1 planning.
573.3 Andrew McNab and David Colling to form Tier-2 evolution working group: members in 2wks, terms of ref in 4wks, outline plan in 8wks.
573.4 Create/pick an MoU and other required documents: Peter Clarke to form a group with David Kelsey, Jeremy Coles and others.