GridPP PMB Meeting 704

GridPP PMB Meeting 704 (08.04.19)
=================================
Present: Dave Britton (Chair), Tony Cass, Pete Clarke, Alastair Dewhurst, Tony Doyle, Pete Gronbech, Jon Hays, Steve Lloyd, Andrew McNab, Andrew Sansum, Louisa Campbell (Minutes).

Apologies: David Colling, Dave Kelsey, Roger Jones, Gareth Roy.

1) GridPP6 Response to the Office
=================================
DB thanked members for their input to the document which is now in good shape, particularly PC for leading and shaping the response which will be submitted this afternoon to the Office and Glasgow JeS will also be submitted. The PMB approved the final version.

DB confirmed V51 of the GridPP6 proposal has been submitted including changes to titles of 5.1 and 5.3 (WP1a and WP1b now WP1c has also been included and a sentence added to explain, also a duplicate Table 20 full financial table with updated numbers extracted from the JeS forms with <0.2% difference. Risk register is now Appendix 2). This has now been uploaded as the case for support along with the financial tables.

2) GridPP6 Response to the Panel
================================
There is a skeleton document that DB, RJ and JH added comments to and PC created V2 with comments. This must be submitted by 23rd April and DB covered each question so it can be progressed this week and fine-tuned the following week. Questions to be covered are:

  • GridPP6 applicant guidelines and description of how new technology will be used. GPUs could be covered and possibly HPC. We should note International collaboration and efforts in coding related to HW that then may have to change, highlighting that we are operating within certain constraints. This is largely in the hands of the experiments, i.e. we work with them to ensure the efficient use of existing technology, perhaps cover compute runs at highest possible load and multi-threaded frameworks. Particular focus should be on how sites work hard to maximise throughput. Need to discuss with experiments – e.g. Icecube running on GPUs. Our influencing role is critical and should be drawn out.

WP4 work – develop technology to make gains. Tier-1 and AD should contribute to this covering influencing as we have very good people in post that can identify inefficiencies and resolve or direct to appropriate people for resolution. This is an opportunity to highlight value of liaison posts.

  • Financial savings in GridPP5 and expected savings in GridPP6 on IRIS and EU projects, continuing coordination across UKRI. For GridPP5 case studies can be used, e.g. LZ and Dune as well as IRIS, IGO and LOFAR. Savings in GridPP6 – we can perhaps describe potential savings e.g. Tier-1 etc with efficiency savings and staff (16 down to 14 staff at Tier-1 over GridPP5 due to efficiencies and usefulness to other communities). This could be extrapolated out to how we can make similar savings in GridPP6, but make clear that we cannot make similar reductions for GridPP6. We could note value for money of funding GridPP6 helps other projects that have not been funded with staff, e.g. IRIS – we can run for them at a lower cost.
  • Payment from partners for over-delivery of MOU resources? We should note that LHC experiments (especially Atlas) rely on over-pledge resources and there is a related statement from the Computer Resource Group relating to this so there is no over-delivery for the science and therefore no question of a refund since over-contribution is from resources not funded by GridPP. We are making the pledge, not over-delivering. DB will look at the blue accounting portal to check the level other countries contribute to demonstrate shared effort. Critically, we cannot pledge what we do not own and experiments cannot rely on things that are not pledged. Contingency of future resource – all relate to quality of service. Decrease in funding for Tier-2s could have relevance. DB, RJ, AD and DC will work on this.
  • Evidence for reduction of Tier-2 would disadvantage UK analysis capability. We are part of a group so jobs can run anywhere – we must provide our pledge and we provide resources for UK and other groups with bandwidth. Resources and adaptability should be drawn out here, e.g. Tier-2 can be used where necessary for intermediate analysis then distributed across sites in a sensible manner and Tier-3 can use that storage to access when necessary.
  • Scientific justification for requesting FEC for each post in similar manner for CG-supported posts. Could be interpreted that they are either asking for a post-by-post scientific justification based on post-holder, or a general case on FEC for all posts. It is considered unnecessary to justify each post-holder as this is not done for CGs – TD and PC suggest a general justification based on quid-pro-quo. We should state they are engineers and give talks at conference, highly skilled therefore qualify for FEC and a case has been written for each post. CGs have been effected due to variable FEC-related costs in each institution, but that has been handled in the GridPP6 proposal according to the guidelines and posts have been reduced to manage that – this should be made explicit in the text noting that universities will make substantial losses if the posts do not attract FEC.
  • Rationale for removing academic capability or experimental support instead of reducing travel, FECs or overheads. FEC is not our gift and relates to Question 5. We must defend the travel budget as critical to our effective running of the collaboration – DK has circulated information in this regard.
  • Division between Tier-1 and Tier-2 not well explained and should be clarified and consequences of removing Tier-1 explained. We should make clear the balance between Tier-1 and Tier-2 are determined by experiments and the distinction is between quality of service and consequence of removing Tier-1 are dire (AD has text that he will insert).
  • Not explained how the UK pledges would be met for Tier-1 and Tier-2 in terms of Resources and Capital. Focus on the minimum required to meet pledges and how the compute would be ramped up with the upgrade. The full case is required to meet the pledges for the MOU and to avoid creating issues for the future. Re it not being clear how compute required relates to resources and on how compute ramps up with the upgrade – we are not asking for resources for that so it is unclear how to best respond. AD stated there is text for this in the proposal, we could note that group resources are not formally needed to deliver the pledge. This is challenging to define with precision due to uncertainty over future costs of HW. Clarification on latter part may be required.
  • Tier 2 electricity costs. GridPP5 document outlines – this relates the earlier points on FEC-related posts, ie electricity is not charged if posts are FEC’d and if FEC is removed then running costs would be charged so this is a leverage included on impacts deliverables.

DB asked members to contribute text they feel able to do so, he is unable to deal with until Thursday and Friday this week. He has created V3 and sent to AS and AD for inclusion of their text. They will create V4 for wider circulation to other PMB members.

ACTION 704.1: ALL should discuss with experiment reps to develop response for question 1 of the GridPP6 Response to the Panel.

ACTION 704.2: AD to develop a response relating to WP4 work (question 1) of the GridPP6 Response to the Panel.

ACTION 704.3: DB, RJ, AD and DC will work on question 3 of the GridPP6 Response to the Panel.

ACTION 704.4: ALL should review and contribute to the GridPP6 Response to the Panel where appropriate.

3) Q4/18 Report missing
=======================
These are urgently required for the OSC and we are already into Q2.

  1. Tier-1 (AD) – AD confirmed he will submit this week, he is working on this with Darren.
  1. CMS (K Ellis & DC) – DC was not present.
  1. Operations (MD) – still missing.

4) Oversight Committee Inputs
=============================
Documents are due in 2 weeks and need to be prepared as a priority. DB reminded members this is a new OSC and an important opportunity for positive impressions to be made. DB will email and action specific members as usual. DB will write a report for this and for GridPP42.

ACTION 704.5: DB will write a report for the OSC.

5) AOCB
=======
GridPP42 sponsorship – AD has not yet been able to secure a sponsor: BIOS-IT cannot sponsor and Dell state they now need 1 year advance notice for future sponsorship requests. AD will be speaking with EXMAY this afternoon in this regard, but not hugely optimistic. It is challenging for others to find sponsors due to local constraints/ conflicts. DB confirmed if no sponsors are secured then the dinner costs would require to be absorbed.

6) Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————-
DC was not present and no report submitted.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
Some tidy-up of the LOCALGROUPDISK in ATLAS – more to be done later on.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
DC was not present and no report submitted.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
No report submitted.

SI-4 Production Manager’s report (JC)
————————————-
No report submitted.

SI-5 Tier-1 Manager’s Report (AD)
———————————
– We are seeing high outbound packet loss over IPv6. Investigations ongoing.

– On Thursday 4th April, around half the containers (which run a job inside) restarted. This killed the job inside them, we do not understand the cause of this yet although it was on a particular subset of machines.

– gdss700 (LHCb) suffered a double disk drive failure on Friday 6th. It has been out of production since. We are having problems rebuilding it as there also appear to be disks with soft failures. We are trying to recover the ~8000 files that are not mirrored on Echo and will then decommission the machine. There was an action from the OSC last time relating to this so AD may write something.

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
PC attended and circulated comments. AD noted section on security aspects – UK has been pushing the IAM rather than EGI check-in, e.g. IRIS.

SI-7 External Contexts (PC)
———————————
No report submitted.

REVIEW OF ACTIONS
=================

644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.

702.1: DC to identify an LZ presentation for GridPP42. Ongoing.

702.3: GR to Update is required to Table 20 to bring numbers in line with the returned JeS forms. Ongoing.

702.5: PC to draft a set of milestones for WP4. Ongoing.

702.6: PC & DB to add some additional text to bring things together (WP1c). Ongoing.

703.1: DC to provide figures for WP2 numbers.

703.2: AD will contact Darren (Tier-1), Tim (Atlas) and Katie (CMS) for Q4 reports.

ACTIONS AS OF 08.04.19
======================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.

702.1: DC to identify an LZ presentation for GridPP42. Ongoing.

702.3: GR to Update is required to Table 20 to bring numbers in line with the returned JeS forms. Ongoing.

702.5: PC to draft a set of milestones for WP4. Ongoing.

702.6: PC & DB to add some additional text to bring things together (WP1c). Ongoing.

703.1: DC to provide figures for WP2 numbers. Ongoing.

703.2: AD will contact Darren (Tier-1), Tim (Atlas) and Katie (CMS) for Q4 reports. Ongoing.

704.1: ALL should discuss with experiment reps to develop response for question 1 of the GridPP6 Response to the Panel.

704.2: AD to develop a response relating to WP4 work (question 1) of the GridPP6 Response to the Panel.

704.3: DB, RJ, AD and DC will work on question 3 of the GridPP6 Response to the Panel.

704.4: ALL should review and contribute to the GridPP6 Response to the Panel where appropriate.

704.5: DB will write a report for the OSC.