GridPP PMB Meeting 667

GridPP PMB Meeting 667 (30.04.18)
=================================
Present: Dave Britton (Chair), Tony Cass, Pete Clarke, Jeremy Coles, David Colling, Tony Doyle, Pete Gronbech (Minutes), Roger Jones, Dave Kelsey, Steve Lloyd.

Apologies: Alastair Dewhurst, Andrew McNab, Andrew Sansum.

1. OC Feedback
==============
1 Capital funding: There was ambiguity as to whether we should include additional capital in reporting against our original budget

PG Action: Clarify with STFC what is required

2 Noted operation against metrics are fine

3 Our estimate of capital shortfall for FY19 to meet April 2019: They seem to recognise that we are in a chronic shortfall due to exchange rates etc. They are supportive and will help if possible.

Action on us to do h/w planning before next OC for the rest of the project.
DC notes LZ will be ramping up. Although small wrt LHC.

4 Castor replacement.
They support our current direction of travel looking at commercial solutions.

5 Oracle tape plan.
They are happy with our plan to use T10KD generation to the end of the project.We have noted that we will end the project with a very full tape store, which will mean something big required at the start of Gridpp6.

6 T1 staff effort approved to carry forward 1.5FTE . TM approved it on the basis of OSC recommendation. Money cannot be released but can go ahead.

7 Staff turnover rate . Our request to aim to staff at 110% in order to try to average out at 100%: TM was not comfortably with this although the OC is supportive of possible ways forward.

8 Have to spend the T2 money according to the h/w profile.

9 CDT comments noted.

10 UKT0 development noted.

11 Note T1 manager change

12 Next meeting ~Nov 2018.

Don’t see anything particularly negative, generally supportive.

ACTION 667.1 PG Clarify with STFC what exactly is required for the OC feedback. wrt the Capital reporting.
ACTION 667.2 Need to do h/w planning before next OC to provide OC with details of shortfall in funds.

2. F2F Agenda
=============
Potential date of 6th June.
TC might not make it. DK is at CERN. JC has a meeting in Cambridge.
DC is free.
Not a critical meeting but a chance to sit around a table and look at roles in GridPP5 and start to push ideas forward.

3. GDPR
=======
GDPR DK spoke at WLCG workshop at Naples. And also at the MB.
DB wanted to add that he had spoken to someone who had been part of the legislation and he was less sanguine about this.
There will be early fines to fund this process. Self-funding.
He thought there could be shots fired at academia to make examples.

DK We have email and institute data. Academia is taking it seriously.
We hope that the code of conduct is approved. If it is refused that academia needs to think what it does now.

DB Just want to make sure our head is below the parapet. Need a privacy statement for all services.

DB When we open registration for GridPP41, do we need a privacy statement?
DK thinks it must be yes. Should describe what we do with the data when will we delete it.
TC thinks Louisa is aware of this for local reasons.

Jens: Certificate renewal …. Is there a policy in place?

Use of passport for critical id purposes.
PG What about lists of attendees at past meetings?
DB Could you be prosecuted for data collected 10 years ago.
TC Yes normally can only keep data for 6 months.
DB we have 660+ PMB minutes with lists of attendees in public on the web.

Action
GridPP is not a legal entity, but could be embarrassed.

4. AOCB
=======
None

5. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
Nothing significant to report.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing significant to report.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
Nothing significant to report.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
Nothing significant to report.

SI-4 Production Manager’s report (JC)
————————————-
1. I circulated the WLCG T2 A/R figures for March which are as follows:

ALICE http://wlcg-sam.cern.ch/reports/2018/201803/wlcg/WLCG_All_Sites_ALICE_Mar2018.pdf

All okay

ATLAS http://wlcg-sam.cern.ch/reports/2018/201803/wlcg/WLCG_All_Sites_ATLAS_Mar2018.pdf

QMUL: 84%:84%
RHUL: 73%:73%
Manchester: 89%: 100%

CMS http://wlcg-sam.cern.ch/reports/2018/201803/wlcg/WLCG_All_Sites_CMS_Mar2018.pdf

Bristol: 67%:67%

LHCb http://wlcg-sam.cern.ch/reports/2018/201803/wlcg/WLCG_All_Sites_LHCB_Mar2018.pdf

QMUL: 87%:87%
RHUL: 74%:74%
Manchester: 89%:100%

The sites responded as follows:

· Manchester was in downtime for few days to reorganise their machine room.

· RHUL had some problem with their network and a CE.

· QMUL had a lot of issues with a particular type of atlas workload that kept causing a specific batch of hardware to crash and use up the local disk space, or reboot the node.

· Bristol TBC.

2. Other VOs. Actively following up with LSST – issue with VM being still paired to the gridpp-cernvm context. GalDyn have expressed an interest in running some new orbit simulations – they hinted that previous attempts ran into certificate issues but this may have been due to a lack of a local RA. Lancaster are stepping in to help out.

3. Other activities: Webdav endpoints are being put in to GOCDB. Singularity enablement for CMS almost complete, just checking on the Bristol status. Finally, CentOS7 rollout seeing steady progress across sites.

SI-5 Tier-1 Manager’s Report (AD)
———————————
GS supplied the below:
So here is a brief comment on the last week – specifically on the new RAL firewall and some problems that followed.

The main point was the scheduled RAL firewall replacement last Wednesday (25/4/18). This appeared to be completed successfully. However, during that afternoon the connection between the Tier1 network and the RAL core was failing intermittently. The problem seemed to get worse as the afternoon progressed. Due to disruptions to our connectivity both on and off site, caused (we think) by running on the new secondary firewall, we flipped the Tier1 connection to go through our ‘standby’ router. The change occurred at approximately 17:00 on the Wednesday and normal service was subsequently restored. However, it was only realized on the Thursday morning that although our IPv4 traffic had been running well overnight our IPv6 traffic was not. This was fixed first thing Thursday morning (by also moving it to the link through our standby router). We have remained in this configuration since.

There are two firewalls at RAL. We are currently running on the (new) secondary one. This coming Wednesday (2nd May) the Networking Team will perform a switch back to the (new) primary firewall to check that the problems that had been seen are now resolved.
AD added:
– Rucio workshop was a success. Instance was setup, we started moving SKA (well LOFAR) data around. I will provide more details at a UKT0 meeting.

– We have started the procurement of some more Tape media. At current rate of consumption we only have ~2 months left. We will be buying around £100k worth to prevent us from running out. This will be under the threshold to go out to full tender and Tim is confident it will be delivered in a timely manner. We will need to do a larger procurement later in the year.

– Primary OPN link went down on Tuesday morning (Two and three stayed up). This was fixed by Wednesday around 3pm. JANET fixed it by rebooting something.

– Wednesday’s firewall replacement caused problems. We had tickets from both LHCb (couldn’t access srm-lhcb) and CMS (FTS transfers failing). After the change, we would have frequent periods of a couple of minutes where machines had no outbound connectivity. At 17:00 Martin flipped the Master-Slave router pair. This appeared to fix one problem but caused the IPv6 traffic to go down. This was fixed by Thursday morning.

– We have restarted all the XRtood gateways on WN to fixed a problem with CMS SAM tests.

– A couple of castor disk servers called out during the week but were quickly fixed.

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
Nothing to report.

SI-7 External Contexts (PC)
———————————
Nothing to report.

REVIEW OF ACTIONS
=================
663.2: PG will canvas sites to ascertain when they want to spend money and determine how disk will be phased out. Ongoing.
663.3: RJ and DC will advise how the experiments want disk divided for the start of Run 3 (Alice and LHCb are resolved. Update: algorithm needs discussion and thinking, f/b from sites on what storage is where will help). Ongoing.
663.4: PC will publish our input to Balance of Programmes Review on GridPP website (PG to complete). Ongoing.
663.8: JC will examine GridPP staff roles/service/areas of expertise. (Update: PG supplied the list of roles but JC has not yet worked further on this). Ongoing.
663.9: AM will share baseline of interfaces he will draw up for UKT0 participating sites before a F2F in June. (Update: AM advised the two documents in the last two actions now exist but are still being discussed with UKT0). Done
663.10: AM will share list of interfaces which experiments need to be able to participate in the UKT0 service. (Update: AM advised the two documents in the last two actions now exist but are still being discussed with UKT0). Done.
665.1: AD will raise issues relating to (VENDOR) delivery of h/w with Lindsay and Martin
665.2: AD will produce Procurement schedule for the coming FY to build in an additional month to buffer any delays in the future.
665.3: DB will follow up with RJ on the Atlas post.
666.1: DB to set up Doodle poll about F2F in June. Done

ACTIONS AS OF 30.04.18
======================
663.2: PG will canvas sites to ascertain when they want to spend money and determine how disk will be phased out. Ongoing.
663.3: RJ and DC will advise how the experiments want disk divided for the start of Run 3 (Alice and LHCb are resolved. Update: algorithm needs discussion and thinking, f/b from sites on what storage is where will help). Ongoing.
663.4: PC will publish our input to Balance of Programmes Review on GridPP website (PG to complete). Ongoing.
663.8: JC will examine GridPP staff roles/service/areas of expertise. (Update: PG supplied the list of roles but JC has not yet worked further on this). Ongoing.
665.1: AD will raise issues relating to (VENDOR) delivery of h/w with Lindsay and Martin
665.2: AD will produce Procurement schedule for the coming FY to build in an additional month to buffer any delays in the future.
665.3: DB will follow up with RJ on the Atlas post.
667.1 PG Clarify with STFC what exactly is required for the OC feedback. wrt the Capital reporting.
667.2 Need to do h/w planning before next OC to provide OC with details of shortfall in funds.