GridPP PMB Meeting 634

GridPP PMB Meeting 634 (22/05/17)
=================================
Present: Dave Britton(Chair), Tony Cass, Pete Clarke, Jeremy Coles, David Colling, Tony Doyle, Pete Gronbech, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Roger Jones, Dave Kelsey.

1. Quarterly Report
====================
PG confirmed all reports have now been received and he will update the project map (graphic) and stats then summarise for PMB and assess metrics for insertion into the OC report.

2. OC Docs
==========
PG summarised the status of each section in the draft PG circulated:
Introduction – DB has completed the introduction which highlights the various features of the report. There was some discussion on the fine detail of aspects included in this section.
Wider Context – DB will finish off this aspect. PG will work this up now that he has received the quarterly reports.
Risk Register – See below.
Tier1 – AS provided text for this section and will add some for information under the management section then edit. He will include some high level information on procurement, possibly mention impacts of exchange rates driven by issues around Brexit and the forthcoming General Election. He will also mention the Tier1 Review.
Deployment Status – JC has provided most of the text for inclusion which will be edited.
ATLAS – RJ has submitted draft text for inclusion.
CMS – PG has adjusted the formatting of this section to align with others. DC is working on the relevant text – he will mention the task force he is heading up across CMS.
LHCb – AM has provided text for this section.
Other VOs – JC and DC have provided text for this section, JC will review the text and refine it before DB reads through.
Impact and Dissemination – SL provided text for this section. The Dissemination Officer post currently being advertised is included.
Financial section – PG is currently putting this, the risk register and project map together. He normally collects actual spends from all Tier2 sites and he has all but one to include. There is a spreadsheet form with the information and PG will include accompanying notes.
3. Risk Register
================
Ceph risks were discussed separately at the F2F in Sussex.
a) Castor – recent issues may slightly elevate the risk to 7 in line with Tier1 risks, but the forthcoming upgrade will improve the situation and reduce that risk. The action required is to continue migrating off Castor.
b) Ceph – this could be reduced to 3 on likelihood and note the project is making good progress with reduced risk.
c) Tier1 – the cooling upgrades and UPS upgrades are done – no change.
d) Failure of Tier1 to meet commitments – should increase to amber.
e) Loss of custodial data at Tier1 – new Echo storage system is now live so may very slightly increase risk due to potential data loss with a new system. The risk should be increased very slightly but not to amber.
f) Substantial loss of h/w for fire – no change.
g) Disaster at Tier1 – no change.
h) RAL – no change.
i) Failure to deploy/operate h/w at Tier1 sites – risk for deployment should not change.
j) Insufficient network bandwidth – this should be slightly decreased to green.
k) Over contention of resources – no change.
l) Difficulty with STFC budgets with resource – remains high for the forthcoming financial year, although this is slightly more secure and should probably reduce slightly. Reduce to amber.
m) Technology shifts in WLCG – no change.
n) Loss of experienced personnel at Tier2 – e.g. Brexit. No change.
o) Insufficient funding to meet h/w commitments at Tier2 – no change.
p) Midware at Tier2 unable to cope with demand (e.g. problems with multicore) – no change.
q) Experiment software runs insufficiently on the Grid – the last quarter at RAL confirms this is accurate and this has been an issue so mitigation processes have been followed. Increase risk slightly.
r) Reputational risk due to security problem – this is very topical at the moment and impact is potentially slightly higher to demonstrate how seriously we take security and steps we take to mitigate (in real terms and publicity terms) – increase to amber for impact. However, reduce Likelihood.
s) Non-availability of Tier1 and Tier2 service due to security vulnerability (i.e. risk through security incident). The likelihood is increased if things have to be taken down to avoid compromise – some issues that could arise are challenging to fix. Increase risk
t) Insufficient effort to support VOs or users – no change.
u) Mismatch between budget and h/w costs. This remains high, particularly in light of recent issues – no change.
v) Funding for central services that GridPP relies on underfunded (e.g. NGI, Dirac). No change.
w) Breakdown of core operation structure (e.g. EGI infrastructure) – no change.
x) Insufficient travel funds – no change.
y) GridPP resources prove insufficient for actual requirements. This is potentially increased and may be impacted by exchange rates – this should increase to 9 or 10 (not red for impact as this is dependent on deficit).
z) Critical middle ware no longer supported – no change.
aa) Unplanned infrastructure costs – no change.
bb) EGI does not continue due to UK no longer continuing to be a member – no change.
cc) Financial uncertainty – no change.
dd) Conflicting opinions about GridPP – no change.
ee) Failure to achieve further integration. We continue to develop new communities – no change.

4. AOCB
=======
a) There was some discussion on appropriate dates of forthcoming meetings due to complexities of availabilities bank holiday next week, following week many members are away etc, following week DB has a symposium in Bath, then WLCG workshop in Manchester (PG, JC, AM will also attend), annual leave, etc. For example, Monday 29th May – the OC documents must be submitted on Wednesday but many members are unavailable Monday 29th May and the following was agreed:

Next PMB Meeting agreed Tuesday 30 May 1pm, then:

Monday 12th June (DB cannot attend – PG will chair)

Tuesday 20th for members available at WLCG in Manchester

Monday 26th June

Summer (alternate weeks): 3rd, 17th and 31st July

Monday 14th August

Louisa has circulated potential meeting dates in between

F2F 13th September at Lancaster

b) Updates for the OC documents – these should be emailed to PG and he will cut and paste into the main document.

5. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
Nothing to report

SI-1 Dissemination Report (SL)
——————————
Nothing to report

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing to report.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Nothing to report.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
AM noted an unusual comment in operations meeting re SOM at RAL providing wrong transfer url – AM is investigating.

SI-5 Production Manager’s report (JC)
————————————-
Potential issue about downtime notifications, currently under discussion. Chep papers are currently being considered.

SI-6 Tier-1 Manager’s Report (GS)
———————————
Infrastructure:
——————
– Following the UPS failure a replacement was installed at the end of the week before last. Then on Tuesday last week (16th) the new UPS was put in service. A load test of (of UPS and generator) was then successfully carried out on Wednesday morning (17th).

Castor:
———
– The LHCb Castor instance has performed OK since it was upgraded on the 11th May. However, it had been hoped to run a load test against it while the Castor Team were at CERN last week – but that did not happen. The upgrade of Atlas and CMS instance are scheduled for Tuesday and Thursday of this week. (GEN to follow).

Networking:
—————
– As reported last week Central networking are investigating a problem with the site firewall that seems to affect some data flows – in particular it has affecting videoconferencing. It is not clear if this could be having any effect on our services. I do not have any further information on this.
– The cable with the third 10Gbit link to CERN for the OPN is now in place. Some work is needed in the OPN Router to configure this – this will need to be done in conjunction with CERN.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
DB was unable to attend – AM attended and advised there was nothing of significance raised. DB summarised the agenda. Hannah gave the talk on the endorsement the PMB provided to DK last week. There was a push on the WLCG workshop in Manchester which is now looking very well attended.

SI-8 External Contexts (PC)
———————————
Nothing of significance to report. Charlotte circulated a questionnaire regarding EGI and its future role re the cloud, etc. PG submitted a response based around discussions with other members and AS will respond to specific questions.

REVIEW OF ACTIONS
=================
630.2: DB and PG will continue to work on metrics and funding strategies at the macro level. Ongoing.
630.3: DB will tweak his metrics and funding model based on CPU. Ongoing.
631.2: ALL to work on OC documents for submission by end May. Ongoing.
632.1: DB will work on the Introduction of OC doc. Ongoing.
632.2: DB and PC will work on Wider Context section of OC doc. Ongoing.
632.3: PG will work on PI5 status and Risk Register of OC doc. Ongoing.
632.4: AS will work on Tier-1 section of OC doc. Ongoing.
632.5: JC will work on Deployment Status of OC doc. Ongoing.
632.6: RJ will work on ATLAS section of OC doc. Ongoing.
632.7: DC will work on CMS section of OC doc. Ongoing.
632.8: AM will work on LHCb section of OC doc. Ongoing.
632.9: JC and DC will work on Other VOs section of OC doc. Ongoing.
632.10: SL will work on Impact and Dissemination section of OC doc. Ongoing.
633.1: AS will put together a proposal for HAG. Ongoing.

ACTIONS AS OF 22/05/17
======================
630.2: DB and PG will continue to work on metrics and funding strategies at the macro level. Ongoing.
630.3: DB will tweak his metrics and funding model based on CPU. Ongoing.
631.2: ALL to work on OC documents for submission by end May. Ongoing.
632.1: DB will work on the Introduction of OC doc. Ongoing.
632.2: DB and PC will work on Wider Context section of OC doc. Ongoing.
632.3: PG will work on PI5 status and Risk Register of OC doc. Ongoing.
632.4: AS will work on Tier-1 section of OC doc. Ongoing.
632.5: JC will work on Deployment Status of OC doc. Ongoing.
632.6: RJ will work on ATLAS section of OC doc. Ongoing.
632.7: DC will work on CMS section of OC doc. Ongoing.
632.8: AM will work on LHCb section of OC doc. Ongoing.
632.9: JC and DC will work on Other VOs section of OC doc. Ongoing.
632.10: SL will work on Impact and Dissemination section of OC doc. Ongoing.
633.1: AS will put together a proposal for HAG. Ongoing.