GridPP PMB Meeting 696

GridPP PMB Meeting 696 (04.02.19)
Present: Dave Britton (Chair), Tony Cass, Pete Clarke, David Colling, Alastair Dewhurst, Tony Doyle, Pete Gronbech, Jon Hays, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew McNab, Gareth Roy, Andrew Sansum, Louisa Campbell (Minutes).

Apologies: Jeremy Coles.

1. GridPP6 proposal preparations
a) Plan and Status
DB sent Charlotte a spreadsheet with £5M of capital spend that GR kindly pulled together. He will prompt in a couple of days noting some items on the list will go beyond the deadline of the end of this week. DB thanked everyone for their contributions in such a tight timescale which demonstrates our efficiency in producing such information quickly. This was prioritised and placed into 4 different categories of need, including local site priorities and balanced between institutes.

b) Resource planning and other VOs
DB circulated Plan V2 with slightly rearranged sections and aspects for discussion. The initial 3 sections remain the same as last week. The Section on ‘Context’ requires writing, e.g. IRIS and CMS and PC has contributed in some places. It was agreed that ‘tokens’ will be passed between relevant PMB members when they are working on version, PC will work on ‘Context’ section tomorrow.

Mapping efforts on different axis and WPs impacts other integration to different sections. DB has moved some content around to more appropriate places – Section 4 (Project structure) with sections and he will add detailed descriptions of WPs here instead of Section 6 (manpower). JC is looking at tables from GridPP5 to determine how it can be usefully re-worked in the new proposal and it will inform descriptions of the WPs. WP1 operational efforts (T1 and T2), also running services for UK Grid. WP2 User Liaison & operations/support – effort to liaise and support LHC and Non-LHC work. AM, DK and RJ were actioned to provide text: AM has supplied text and RJ will write initial description of WP2. WP3 (JC & DB) effort to fulfil commitments to international community should come from JC’s tables with updates. WP4 Development Work – Tier1 tasks that AD will have input to and provide some examples, PC has provided some content on this that he can continue to work up with input from others, e.g. building new tape store, data management (1-2 pages). WP5 Management and Admin (DB, GR and PG) – management and admin posts etc, and Impact section, JH will write Pathways to Impact and DB will extract some elements to here (c. 2 pages). Section 5 Experiment Requirements/Infrastructure – GR pulling together information for Resource planning and some inconsistencies have been worked through to provide more realistic information and realities between flat cash and different iterations. GR circulated a spreadsheet after checking the accounting portal for significant resources, PG, PC and DC have responded and GR asked for any further information on VOs so he can send out requests for resource information. Section 6 Meeting WLCG and Experiment Requirements Reorganisation is addressed in 3 sections: brief introduction referring to strengths of FTEs; UK Tier1 service – AD has a draft of this; and Effort Matrix which will map effort onto WPs. Section on UK Tier2 service (not all of whom deliver UK Tier2 service but their effort must be accounted for). DB urgently requires input from Atlas and CMS regarding guidance where the posts supporting these experiments are located. AM has provided input for LHCB – 4 x 0.25 FTEs at different locations, similar information from Atlas and CMS is now urgently required so that DB can map onto tables he already has. DC confirmed group leaders have discussed and he is summarising decisions with them to send to DB this afternoon. RJ has discussed with Computing services and will need input from others to provide a reasonable recommendation from the experiment to GridPP – DB and RJ will discuss further. Sections 7, 8 and 9 cannot be started until tables are established (capital, resource funding planning; how to address de-scopes, summary & conclusions). We are now at the mid-point between the starting the proposal and submission and DB confirmed a lot has been achieved, but more work is required. DB checked what was previously sent to the CB in December and again on 18 January and will send updated strategy.

ACTION 696.1: RJ to provide ATLAS resource requirements
ACTION 696.2: RJ to provide ATLAS’ guidance for 2 FTE location at Tier-2 sites.
ACTION 696.3: RJ to draft 4c(iii) in the Plan2 document: A description of WP2.
ACTION 696.4: JC to update GridPP5 tasks tables and map to GridPP6 Workpackages.
ACTION 696.5: JC to draft 4c(iv) in the Plan2 document and work with DB on 4c(i).
ACTION 696.6: SL to refine/check Tier-2 Leverage Section.
ACTION 696.7: JH to draft pathways-to-impact document and extract 1 page for proposal.
ACTION 696.8: PC to work on Context section and suggest merger of motivation section.
ACTION 696.9: PC to coordinate development of 4c(v) WP4 description.
ACTION 696.10: AD to provide draft of Tier-1 section 6b
ACTION 696.11: AD to contribute via PC to 4v(v)
ACTION 696.12: DC to provide CMS’s guidance for 1FTE location at Tier-2 sites.
ACTION 696.13: DC to provide assistance to RJ with 4c(iii).
ACTION 696.14: DB to contact CB.
ACTION 696.15: DB to draft 4c(ii) with help from JC.
ACTION 696.16: DB to coordinate 4c(vi)
ACTION 696.17: DB to continue to develop effort matrix once Experiment site preference are known.
ACTION 696.18: GR to continue to gather resource requirements.
ACTION 696.19: GR to liaise with PG on 4c(vi)2&3

2. Researchfish
GR reminded the PMB that Researchfish is now open.

a) Hepsysman is planned for 22 May and PG asked if any major clashes with that. As none were raised PG will beginning organising and gathering items for the agenda.
b) PC will Chair next Monday’s PMB in DB’s absence with very restricted internet access.

4. Standing Items

SI-0 Bi-Weekly Report from Technical Group (DC)
AD advised Rucio was discussed and James Perry and Tang gave updates on work at Edinburgh resulting in an action to meet and discuss object integration aspect of James’ work to ensure no duplication. Minutes and slides are available.

SI-1 ATLAS Weekly Review and Plans (RJ)
The inefficiencies noted last week have subsided over the weekend but are still being investigated. Discussion is ongoing about the use of lightweight sites with Birmingham used as an exemplar. There has been success running singularity jobs at RAL.

SI-2 CMS Weekly Review and Plans (DC)
Nothing significant to report.

SI-3 LHCb Weekly Review and Plans (PC)
Nothing significant to report.

SI-4 Production Manager’s report (JC)
JC not in attendance and no report submitted.

SI-5 Tier-1 Manager’s Report (AD)
– Tier-1 CPU usage report for January is attached.

– CPU Efficiencies have improved for CMS (>80%), although it is still fluctuating a lot. ATLAS has still 60 – 70% efficiency, Tim is investigating.

– The system drive in a disk server for LHCb failed on Thursday afternoon. This was a 14 generation machine (dual purpose for Ceph). The operating system was put on the SSD (to leave other disks for capacity), which is attached to the underside of the motherboard… Kash is doing open heart surgery today to install another disk.

– The disk buffer in front of our new Castor tape instance almost filled up. We don’t know the cause but on the 25th January after several months of working perfectly the Garbage collection daemon stopped working quickly properly (a few files an hour). We have manually been wiping files from the tape buffer to keep space clear while we understand the problem.

– While we were investigating the full buffer we found that NA62 has been writing files to the “disk” endpoint on wlcgTape. This endpoint does not get written to tape and was designed for a small number of functional test files (e.g. SAM tests which get copied in and immediately deleted). There are 196,668 files using up 11TB of space that as it currently stands will never be migrated to tape (and if they are not used, they will be deleted eventually). Most of these files have been written in the last few weeks. I have started a conversation with NA62 (Dan + Dave) to urgently find out how important these files are.

– arc-ce04 has stopped working again. We are not sure if this is related to the number of LHCb jobs submitted to this CE or not. Catalin has rolled out an updated version of the software to arc-ce05 for testing, which should fix the problem but at the very least it will mean that the ARC developers will look at the error. Unfortunately it is likely to break backward compatibility with some VOs. It would be nice if we could get LHCb to submit their jobs more evenly across our CEs (currently the ratio is 0:25:25:50 for arc-ce0[1-4] respectively).

– Procurement
We have been spending ~£400k in the last week, this is £213k extra, £44k of left over capital and the remains of the resource budget (£225k). This is mostly with DELL, some purchase orders have gone out, last of the purchase orders should go out by tomorrow (Tuesday 5th February).
Capital procurement is waiting for delivery:
DELL Storage, delivery has been booked in for 4th March.
XMA CPU, expected delivery ~2 weeks from now.
XMA Storage, mid March, as confirmed by technical team.

AM asked if Compute element for LHCb has been followed up and made suggestions on how to proceed if necessary making sure Developers are made aware.

SI-6 LCG Management Board Report of Issues (DB)

SI-7 External Contexts (PC)
Over the past week PC has been drawing together long term resource requests from all of STFC of which 2 lines are GridPP and Dune. This was discussed with Mark Thomson and DB for GridPP with figures including all the requests. This sets out more than £100M by 2026 and Mark provided an encouraging response to this.

644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.

ACTIONS AS OF 04.02.19
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.
696.1: RJ to provide ATLAS resource requirements
696.2: RJ to provide ATLAS’ guidance for 2 FTE location at Tier-2 sites.
696.3: RJ to draft 4c(iii) in the Plan2 document: A description of WP2.
696.4: JC to update GridPP5 tasks tables and map to GridPP6 Workpackages.
696.5: JC to draft 4c(iv) in the Plan2 document and work with DB on 4c(i).
696.6: SL to refine/check Tier-2 Leverage Section.
696.7: JH to draft pathways-to-impact document and extract 1 page for proposal.
696.8: PC to work on Context section and suggest merger of motivation section.
696.9: PC to coordinate development of 4c(v) WP4 description.
696.10: AD to provide draft of Tier-1 section 6b
696.11: AD to contribute via PC to 4v(v)
696.12: DC to provide CMS’s guidance for 1FTE location at Tier-2 sites.
696.13: DC to provide assistance to RJ with 4c(iii).
696.14: DB to contact CB.
696.15: DB to draft 4c(ii) with help from JC.
696.16: DB to coordinate 4c(vi)
696.17: DB to continue to develop effort matrix once Experiment site preference are known.
696.18: GR to continue to gather resource requirements.
696.19: GR to liaise with PG on 4c(vi)2&3