GridPP PMB Meeting 668

GridPP PMB Meeting 668 (14.05.18)
Present: Dave Britton (Chair), Tony Cass, Jeremy Coles, Alastair Dewhurst, Tony Doyle, Pete Gronbech, Roger Jones, Steve Lloyd, Andrew Sansum, Louisa Campbell (Minutes).

Apologies: Pete Clarke, David Colling, Dave Kelsey, Andrew McNab.

1. Quarterly Reports
AD is working with Tim who will produce the quarterly report for Atlas and Tier-1 and RJ will input the narrative and figures for formal submission. Tier-1 will be submitted to PG, this is completed with the exception of manpower figures which will be finalised shortly.
LHCb and CMS are still awaited, but AM and DC are not in attendance today to provide updates.
Security & Operations will soon be submitted by DK and JC.

2. Tier-2 H/W Survey
Interim report on survey and spend – PG attached spreadsheets with lists of sites and their preference for spending FY18-19. Most are not concerned about when the spend takes place so this should be manageable. PG will report on this when he receives input from the experiments and outstanding actions on RJ and DC on how disk should be allocated. RJ noted this will require discussions with others, potentially in June but this could be brought forward if necessary – DB will consider timescales, ie to determine if several big Atlas sites require spending this FY this should be provisionally planned very soon. If some sites are already procuring we should seek to assist them.
Additional document is a profile of site capacities over the coming few years, the numbers need to be tweaked and finalised here. PG will now consider integral and compare with experimental requirements. He will then consider gap between profile and need to ensure spending profile matches requirements. PG will continue with this work on his return from HEPIX.


4. Standing Items

SI-0 Bi-Weekly Report from Technical Group (DC)
There was a meeting last Friday and discussed the nature of future meetings. DC sent a summary of this to the tb-support list over the weekend.
The PMB should have a view on the content of the meetings and what is required and this will be discussed at a future PMB.

SI-1 ATLAS Weekly Review and Plans (RJ)
Soft error rate in Echo (re-try then succeeds) is running around 10% and may relate to CMS. This is being investigated at Tier-1. They are running slightly short on tape. Four sites are experiencing FTS timeouts, the file is transferring successfully but timing-out on wash-up. This has been reported – it relates to Centos 7 and Linux 7 and not thought to be related to the previously reported CMS FTS issue. We are succeeding for everyone except CMS and it may therefore be an issue relating to CMS, but there is no difference in failure rates between RAL and others. There is no ticket to respond to, the comment derives from DC and there has been no change since AD’s email to the PMB last week – Andrew Lahiff had made changes to how tape requests are processed at RAL and this has been fixed, also some recent increase in tape requirements. Most of the failures relate to timeouts. AD is following up with Chris Brew and DB will follow up with DC.

SI-2 CMS Weekly Review and Plans (DC)
The big CMS thing is the FTS problem that won’t go away. DC hadn’t raised it before last week as he had assumed it was a local difficulty that would be easily fixed – which it turns out not to be as it has gone on for 3 weeks now and has had knock on effects elsewhere (other T1s changing setting and taking transfers that should have been managed by RAL etc).

SI-3 LHCb Weekly Review and Plans (PC)
PC not in attendance and no report submitted.

SI-4 Production Manager’s report (JC)
No urgent items for discussion from me this week, but here are some points that may be of interest:

1. The Spring HEPiX takes place this week in Wisconsin. The agenda is at The focus is on site reports, networking, security/identity management, benchmarking and batch updates, storage group progress, tools and clouds.
2. A number of GridPP sites are experiencing PUT DONE errors (failed) for ATLAS transfers. The storage group is investigating. Current analysis suggests sites affected are using SL7/CentOS7.
3. As per an EGI request we are starting our annual review of site GOCDB information.
4. There was a WLCG GDB last week: Topics included use of Dynafed, a framework for lightweight sites (Andrew McNab gave a WG update), updates from CNAF and OSG and an overview from David Crooks on next steps for security trust frameworks.
5. LSST users are seeking to create a CVMFS repository for use with UK work. This is a good opportunity to support both another community (LSST) and common infrastructure (UKT0-IRIS).
6. There is an ongoing need to follow-up with sites that have nodes (flagged by Pakiti) needing updates due to Singularity vulnerabilities.

7. Last week the UK reached a relative high for the number of open GGUS tickets (49). This is due to ongoing IPv6 and perfSONAR tickets on top of more regular issues. There are no “team” tickets beyond green.

8. The next UK HEPSYSMAN meeting is scheduled for 18th-20th June at RAL:

AD enquired about DUNE regarding UKT0 and Tier-1 offering resources to DUNE. JC advised there are a few changes but most collected info is inserted into a Wikipage on the GridPP website. They are currently an incubator project that may grow in the future and things with UKT0 (IRIS infrastructure) will change quite quickly now and funding will be spent over the coming 4 years that will quickly change how these aspects are positioned, but this will continue in the current way in future. PC and AM are attempting to get DUNE computing more integrated in the UK, AM is now acting as our liaison in this regard. They are attempting to get more visibility and more sites enabled by August, thought we may not have IRIS (UKT0) infrastructure in place by then.

SI-5 Tier-1 Manager’s Report (AD)
– Very quiet week operationally.

– Despite Dave Collings complaints to the PMB, we have not received a ticket from CMS regarding our FTS service. We have no evidence of there being any problems with the service.

– Both ATLAS and CMS are seeing some XrootD failures from the gateways on the WNs (for Echo). This is fixed by restarting the gateways (it appears to be some flavour of memory leak). Investigation in progress.

– More hardware is being added to Echo, the remainder of 16 generation and start of 17 generation.

– There has been a sudden increase in write rates by CMS, which was caused because a script Andrew Lahiff had setup to automatically approve tape request had broken and we had a backlog of ~3 months of tape requests. We believe the backlog is now cleared.

DB enquired why this took a few months to recognise/fix and whether it is still relevant to mention. These were write-to-disk providing an extra copy, highly critical data does not need extra approval. A script had been written to automatically approve so this caused it not to be noticed for a while. A liaison needs to be assigned to understand the CMS box to ensure there is no repeat in the future. CMS liaison place is stuck at the visa approval phase and this may be problematic under current guidelines so alternative options are currently being considered to resolve this.

– We are running low on tapes (~500 tapes left = 4.25PB). We started a tape procurement on April 9th. This has been delayed due to an additional GDPR step, however we expect delivery at the start of June. I have asked both ATLAS and CMS if they need to clean up data and as it happens they both have around 2.5PB which is scheduled for deletion. I don’t believe there is anything to worry about now but if the CMS rate increase (see above) had persisted we could have had a problem.

SI-6 LCG Management Board Report of Issues (DB)
There has been no MB.

SI-7 External Contexts (PC)
UKT0 has decided the new name will be IRIS (innovation and research e-infrastructure for STFC) UKT0 will have IRIS in brackets after and in the longer term it will adopt the name IRIS fully. There is now a UKT0 delivery board (equivalent of the PMB) to ensure the project delivers and other structures are being set up, eg review board and possibly asking STFC for funding for project leadership and project management to capitalise some effort and we will put in a case for that in the coming week or so.

644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress). Ongoing.
663.2: PG will canvas sites to ascertain when they want to spend money and determine how disk will be phased out. – Ongoing.
663.3: RJ and DC will advise how the experiments want disk divided for the start of Run 3 (Alice and LHCb are resolved). Ongoing.
663.4: PC will publish GridPP inputs to the Balance of Programmes Review on GridPP website. Done.
663.8: JC will examine GridPP staff roles/service/areas of expertise. Ongoing.
665.1: AD will raise issues relating to (VENDOR) delivery of h/w with Lindsay and Martin. Done.
665.2: AD will produce Procurement schedule for the coming FY to build in an additional month to buffer any delays in the future. Ongoing.
665.3: DB will follow up with RJ on the Atlas post. Done.
667.1 PG Clarify with STFC what exactly is required for the OC feedback. wrt the Capital reporting. Ongoing.
667.2 Need to do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing.

ACTIONS AS OF 14.05.18
644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress). Ongoing.
663.2: PG will canvas sites to ascertain when they want to spend money and determine how disk will be phased out. – Ongoing.
663.3: RJ and DC will advise how the experiments want disk divided for the start of Run 3 (Alice and LHCb are resolved). Ongoing.
663.8: JC will examine GridPP staff roles/service/areas of expertise. Ongoing.
665.2: AD will produce Procurement schedule for the coming FY to build in an additional month to buffer any delays in the future. Ongoing.
667.1 PG Clarify with STFC what exactly is required for the OC feedback. wrt the Capital reporting. Ongoing.
667.2 Need to do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing.