GridPP PMB Meeting 661

GridPP PMB Meeting 661 (19.02.18)
=================================
Present: Pete Gronbech (Chair), Tony Cass, Pete Clarke, Jeremy Coles, Roger Jones, Dave Kelsey, Steve Lloyd, Gareth Smith,Louisa Campbell (Minutes).

Apologies: Dave Britton, David Colling, Tony Doyle, Andrew McNab, Andrew Sansum.

1. OSC Documents
================
PG shared the OSC document (version 0.6 on agenda). Drafts of all contributions have now been received. There is text in all sections, but DB needs to rework the Intro and there are some outstanding questions that should be addressed.
Page 3 – red comments from DB will be addressed, AS has provided the Tape document and is producing a document on the staff plan. There is also a question on whether UKT0 should be covered – PC will address this. DK will provide brief information on the EOSC hub project for inclusion.
GridPP5 status – all stats are up to date and Quarterly reports info has been incorporated. In the last quarters from CMS there was amber responses at Tier1 and it may be useful to include information on that – GS has provided this to the PMB. PG has discussed red and amber metrics.
Wider context – has been submitted.
Risk Register – has been agreed.
Tier1 report – AS has submitted.
Deployment status – JC has submitted.
Atlas – has been submitted
CMS – has been submitted.
LHCb – has been submitted.
Other VOs – has been submitted.

2. GridPP40 Agenda Development
==============================
PC and PG discussed this and it is developing well but more talks may be offered. DB has asked James Adams to give a presentation on HNSciCloud – this will perhaps be slotted into session 6 beside Alexander Dibbo’s talk on the Cloud at RAL. SKA may not give a talk – PC is awaiting a response from them. There was a suggestion for moving the UKT0 talk to an alternative later slot – perhaps to the end of session 2. Possibly reschedule to DC (CMS Efficiency), then STFC (PC) then UKT0 (AM).
Session 3 is fully developed.
Session 4 is full – PC will provide a summary and Duncan will talk.
Session 5 is fully developed.
Session 6 Containers, WLCG working group are covered and another two slots remain available.

3. Quarterly Report Summary
===========================
The contents were discussed. CMS efficiency was discussed, PG asked if there was any connection with number of jobs running – GS confirmed this was the case and is at the raised level. With the imminent start of the use of Echo this will be reviewed – GS tracks this and produces a weekly report. CMS issues should be covered in the report.

4. 2017 Review of Delivery to Experiments
=========================================
This has been done annually by PG, though not always formally. This considers LHC and Other VOs experiments and summarises the metrics. It also looks at the accounting and has some comparisons between what is delivered vs pledge. E.g. Tier1 CMS was below the expected figure, though it is better than last year. Atlas is also low – there is not a clear reason for this but GS will look at this and make a comment if appropriate.

Share of our resources going to other VOs has gone up from last year (100M PBs over 6 hours) demonstrating we are supporting more VOs now at a level more than they were expecting.

5. AOCB
=======
None

6. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
No technical meeting and nothing to report.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
Slight teething problems with the transfer, covered in the Tier1 Manager’s report.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
DC was not in attendance, no report submitted.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
AM was not in attendance, no report submitted.

SI-4 Production Manager’s report (JC)
————————————-
JC was not in attendance, no report submitted.

SI-5 Tier-1 Manager’s Report (GS)
———————————
A report for the Tier1 covering the last two weeks.

Batch:
In my report of two weeks ago I stated that we had disproportionately large numbers of Alice jobs running on the batch farm – and we were running insufficient Atlas work. It was found that the Condor configuration led to some failed Atlas jobs not being cleaned up – hindering the running of new ones. This has been corrected – things look better and we are keeping a watch on it.

Castor:
There was an incident with our disk servers that deal with tape recalls/migrations on our ‘GEN’ instance. A network misconfiguration was introduced after they had been physically moved to a new rack location on Thursday (8th Feb). As a result, tape transfers (for the GEN instance) were not running the following weekend, with this issue being resolved on Monday morning.

Echo:
A high failure rate for both ATLAS and CMS for GridFTP transfer writes into Echo was reported on Friday (9th Feb). After investigation it was decided that while the network (IP) ports were not exhausted the GridFTP server believed they were. Further investigation also revealed that GridFTP requires a contiguous set of ports which appears to cause the available ports to run out much quicker than might be expected. So, for example a 10-stream transfer would require 50000-50010 to be free. The available port range was increased (initially on one gateway node – then extended to all five). This appears to have resolved the problem.

Networking:
On Wednesday (21st) the IPv6 connections, both to the RAL Core network and the ‘bypass’ route will be moved to share the IPv4 connections. We have been running with separate, 10Gbit physical links for IPv6 as this would aid investigations of problems etc. However, the time has come to move the IPv6 onto the 40Gbit connections. This is being done ahead of a change to enable IPv6 (dual stack) on the Echo Gateways – i.e. providing IPv6 access to Echo.

Infrastructure:
– We await further updates regarding an ongoing problem with one of the BMS (Building Management Systems) in the R89 machine room. This has an intermittent fault.
– Work is going on to install cooling such that one row of racks in the R89 machine room will have water-cooled doors. Thus opening up the possibility of higher capacity racks.

Capacity Purchasing:
– The orders are due for delivery; Storage – mid March; CPU – the week after that.

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
DB was not present, no report was submitted.

SI-7 External Contexts (PC)
———————————
PC, Dave Corney and Andrew submitted a business case to BAES for £4M per annum and will attend a meeting to defend this next week. This helps with computing for STFC and relieves the pressure on other things. If this succeeds, PC will determine if it is possible to convert some funds to staff effort for software for non-GridPP experiments, ie not purely Particle Physics.

Computing Review Panel update – PC summarised the concerns over no Particle Physicist being on the panel (now resolved) and the questionnaire that was circulated. PC called the Chair of the review panel who asked for input on amending the questionnaire to a more appropriate language. PC suggested he and DB will complete the questionnaire and add extra information in an appendix. PC will write a letter to the Chair thanking him for the opportunity to contribute more fully. This was not sent to any other Particle Physics experiments except for Dune – it was sent to several in the Astronomy community, though this may be a positive outcome. PC and DB will request experiment support and input once a draft has been developed – RJ will mention to Atlas and DC will mention CMS to advise this is being developed and their input will be requested after the draft is available. PC concentrated mainly on the infrastructure element, but may include mention of software staff.

REVIEW OF ACTIONS
=================
644.3: AS put together a starting plan for staff ramp-down. (Update: a draft will be produced in January). Ongoing.
644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY?) Ongoing.
OC documents MUST be done and submitted to PG this week.
649.2: PC will write Wider Context of OS documents. Ongoing.
649.4: GS and AS will write the Tier1 Status section of OS documents. Done.
649.5: JC will write Deployment Status section of OS documents with input from PG. Done.
649.6: RJ, DC and AM will write LHC section of User Reports in OS documents. Done.
649.7: JC will write Other Experiments section of User Reports in OS documents with input from DC and PG. Done.
655.3: PG to consider the agenda and date for Tier1 review and include disaster recovery plans. (UPDATE: appropriate dates are being considered with AS). Ongoing.
656.1: DK will report before the end of February on any actions GridPP should take to comply with GDPR. Ongoing.
656.2: DC will report on CPU efficiencies. Ongoing.
657.2: DC to report on the CMS taskforce. Ongoing.

ACTIONS AS OF 19.02.18
======================
644.3: AS put together a starting plan for staff ramp-down. (Update: a draft will be produced in January). Ongoing.
644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY?) Ongoing.
649.2: PC will write Wider Context of OS documents. Ongoing.
655.3: PG to consider the agenda and date for Tier1 review and include disaster recovery plans. (UPDATE: appropriate dates are being considered with AS). Ongoing.
656.1: DK will report before the end of February on any actions GridPP should take to comply with GDPR. Ongoing.
656.2: DC will report on CPU efficiencies. Ongoing.
657.2: DC to report on the CMS taskforce. Ongoing.