GridPP PMB Meeting 684

GridPP PMB Meeting 684 (29/10/18)
=================================
Present: Dave Britton (Chair), Jeremy Coles, Alastair Dewhurst, Pete Gronbech, Roger Jones, Dave Kelsey, Steve Lloyd, Gareth Roy, Louisa Campbell (Minutes).

Apologies: Tony Cass, Pete Clarke, David Colling, Tony Doyle, Andrew McNab, Andrew Sansum,

1. Risk Register
================
PG uploaded the latest version of the risk register for discussion.

Castor storage – this had previously been elevated and it was suggested should now be reduced. AD confirmed the risk has significantly decreased in line with his text for the OSC document. It should go down to 2 (likelihood) and 8 or 9 (impact) (DB confirmed the difference between residual and inherent risks).

Tier-1 project fails for 1 week or more – no change.

Failure of Tier-1 to meet SLA – refers to availability/reliability (not h/w) – perhaps reword to ‘fails to meet MOU service level of commitment’. Delivering pledges through h/w as an additional risk (4b).

Significant loss of custodial data on Tier-1. Top level no change but custodial data is on CEPH (tape) inherent risk may have raised slightly due to Oracle tapes and the end of Oracle life which could potentially lead to support taking longer. DB noted a situation recently where NA62 deleted 150TB from EOS (Monte Carlo for the past 4 years) a copy of everything at RAL is on tape and can be copied over. The retrieval rate is over 98.5% and it was questioned whether we should consider if this is a reasonable success percentage rate. AD suggested this should be 100% success rate of more than one tape loss rate may unacceptable.

Substantial loss/damage of h/w at Tier-2 exceeding £2M. Remain the same.

Significant disaster at Tier-1 – ie long outage. Remain the same.

Failure to recruit key technical staff at RAL – was previously high (recruitment and retention issues must be resolved differently with e.g. more young people employed and trained) – inherent risk remains the same, but the mitigation is helping (move away from Oracle may help in the longer term). Remain the same.

Failure to deploy or operate h/w at GridPP sites (Tier-1 and Tier-2) – relates to difficulty of procurement etc. Remain the same.

Insufficient bandwidth – last time the risk was raised but this is improving at RAL. Should be reduced for Tier-1. Should reduce likelihood by 1 point with comments.

Competition for resources – continuing to see increase in demand but now also have IRIS deployed in GridPP which mitigates matters. Remains the same.

Difficulty with STFC budgets. Remains the same.

Technology shift – WLCG technology not supported at Grid. Remains the same.

Loss of experienced personnel at Tier-2 (should be discussed at the next OSC). Remains the same.

Insufficient capacity to meet commitment at Tier-2. Remains the same.

Software at Tier-2 sites. Remains the same

Reputational risk due to serious security problem. Remains the same.

19 – remains the same.

Insufficient efforts to support VOs or users. Last time this was raised. Remains the same.

Mismatch between budget and h/w costs. This year we will hit the pledge – likelihood has slightly decreased but risk is still high due to high exchange rates re Brexit. Remains same (but likelihood reduced by 1)

Funding for software reduced. EOSC hub funding ends in 2020 so this remains the same.

Breakdown of core operations and structures (similar to above). Remains the same.

Insufficient travel funds. The fall of value of Pound Sterling could be an impact. Residual should perhaps rise by 1 or 2.

GridPP resources prove insufficient for actual requirements – we aim to meet 100% target (comments to be changed). Remains the same.

Middleware no longer supported – LFC impact to non-LHC VOs and other questions re EOS future, Castor, etc so it was suggested the risk should perhaps increase slightly. Inherent at amber and residual remains in green.

Infrastructure costs. Remains the same.

EGI does not continue or UK does not continue to be a member (ie Brexit impact). Likelihood slightly increased and Impact also should increase slightly.

Uncertainty – raise next time for GridPP6 when there will be uncertainty as to what is being funded in GridPP6. Remains the same.

Conflicting opinions among GridPP stakeholders. Remains same.

Further integration with PPAN community. Likelihood reduced due to IRIS (reduce to 4 in green).

ACTION 684.1: PG will contact the owners of risks who were absent from today’s PMB to confirm they are satisfied with the decisions taken on the risk register.

2. Update on CMS Disk request
=============================
AD provided an update after his recent visit to CERN. The only guarantees about giving space back was that other Tier-1 sites will be given back and data moved across. We need not feel obliged to solve problems of CMS beyond our remit and the impression is there are some positive things going on but it is not certain we will be provided with the space back as is reliant upon other people responding in a timely manner. It was proposed if possible we will give next year’s pledge early and await any further requests. This was agreed.

3. Coordinating Rucio Work
===========================
Relates to an email from AD – given the number of issues around Rucio at RAL, Imperial and Edinburgh we should have some way of coordinating Rucio work within the UK. DB proposed the Technical Group should perform as the forum for coordinating this work as it is Chaired by DC who is one of the key Rucio people. It was agreed this seems appropriate.
ACTION 684.2: DB will write to AD to suggest the Technical Group take on the coordinating of Rucio.

4. OSC document status
======================
PG discussed contributions received. SL has made some comments, but others have not been received but there have been some issues with uploading documents. DB cannot write the overview until all the contributions are submitted.

DB has completed the beginning sections but has some questions. E.g. Accounting and delivery by Tier-1 over 2018 the contribution from CERN has doubled (27 increased to high 40s), he asked for reasons and AD will raise with ATLAS, but this probably relates to high level trigger farms, DB may discuss with John Gordon. There has always been a Switzerland contribution to Tier-1 pie chart, but it is not clear why this has recently increased significantly – the UK contribution is therefore significantly less than other EU participants, e.g. KIT, so we should address.

Wider context – PC will undertake this week.

GridPP5 status – PG will undertake.

Risk Register – PG will update after today’s decisions.

Tier-1 status – AD has started a draft and higher priorities at the moment are to get tenders submitted then he can undertake the OSC section. He will have a draft prepared later this week and provide to DB to write the Overview section (deadline 2 weeks). PG noted Tier-1 section has historically been quite lengthy, but this may no longer be necessary.

Deployment status – JC was not able to join today for more than a short period due to connection issues.

LHCb – To be done

Atlas – To be done.

Security – To be done.

DK will provide PG with final outcome figures this week.

ACTION 684.3: DB will contact the authors of the OSC documents asking them to complete their sections this week.

5. AOCB
===========
None.

6. Standing Items
=================

SI-0 Bi-Weekly Report from Technical Group [DC]
———————————————–
DC not present and no report submitted.

SI-1 ATLAS Weekly Review and Plans [RJ]
—————————————
RAL 0 tape access to new stage not disruptive. Ongoing exchange at Birmingham about storage, RJ agrees with AD’s concerns that they are making extra work with having the 2 servers, but the suggestion for using the Manchester disk could be an issue that needs investigation. Discussion in the storage group at 10.00am on Wednesday and RJ or AD hope to attend this.

SI-2 CMS Weekly Review and Plans [DC]
————————————-
DC not present and no report submitted.

SI-3 LHCb Weekly Review and Plans [AM]
————————————-
AM not present and no report submitted.

SI-4 Production Manager’s weekly report [JC]
——————————————–
JC not present and no report submitted.

SI-5 Tier-1 Manager’s weekly report [AD]
—————————————–
– Availability of the CMS-AAA service is improving. Machines have had their memory doubled, which has stopped the machines going in to swap. The other problem is a bug which is fixed in a later version of XRootD than we are running. We will need to recompile XRootD, create an RPM and deploy it on the boxes.

– Patching of machines against CVE-2018-14634 is in progress. The WN which were the most at risk of being exploited by this vulnerability were done on the 22nd – 23rd October. We still need to reboot around 500 other machines (total 1500).

– OPs tests are finally passing for Echo solving a GGUS ticket that was open for nearly 2 years!

– Ceph-sn973 developed a hardware fault and was removed from production on Saturday 27th. No degradation of service / unavailability of data (as expected).

– There has been a ~two week delay to the procurement of the disk and CPU. This is because SBS were dealing with a legal challenge to a bid that was similar to our own (specifying that we award the contract to two separate vendors). The tenders should be submitted tomorrow. This delay should not cause any issues on delivery before April this year. The amount of time to test and deploy all the hardware before April 1st is tight.

SI-6 LCG Management Board Report of Issues [DB]
———————————————–
No MB.
SI-7 External Contexts (eg NGI/EGI)
———————————–
Nothing to report.

REVIEW OF ACTIONS
=================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.
667.2 PG will do h/w planning before next OC to provide OC with details of shortfall in funds. (Update: PG will check the OSC minutes for details and cover with GR). Ongoing.
675.1: DC to sign off report on Tier-1 LHC usage. Done.
678.2: DK to finalise the Security, Trust and Identity background document by mid October. (Update: DK and David Crooks have been working on this and it is nearly complete) Ongoing.
678.3: AD to finalise the Tier1 background document, including tape strategy by end September. (Update: Almost complete and will circulate current iteration for comment). Ongoing.
678.5: JC to finalise the Storage background document by end September.
(UPDATE: 17 October meeting with Tony Medland – DB and PC will attend. This is almost complete and awaiting a few minor elements to be worked in – GR will upload into Googledocs for info). Ongoing.
680.2: JC will follow up GDPR implications relating to VOMS with DK. Ongoing.
683.1: DB will write to Mark at Birmingham on ALICE to consider the best solution for EOS proposals at Birmingham. Done.
683.2: DB will write to AS confirming GridPP are happy to continue providing resources using processes already in place (quarterly resource meeting) wherever possible, though are unable to agree for the resource allocation committee to directly allocate resources. Done.
683.3: AD and DC will provide the PMB with information subject to provision of 650 PTB with a plan of how and well it will be returned. Done.

683.4: DB will write to AS advising GridPP strongly support the SCD Strategy for Tape Storage and see this as a way of containing future costs. Done.

ACTIONS AS OF 28.08.18
======================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.
667.2 PG will do h/w planning before next OC to provide OC with details of shortfall in funds. (Update: PG will check the OSC minutes for details and cover with GR). Ongoing.
678.2: DK to finalise the Security, Trust and Identity background document by mid October. (Update: DK and David Crooks have been working on this and it is nearly complete) Ongoing.
678.3: AD to finalise the Tier1 background document, including tape strategy by end September. (Update: Almost complete and will circulate current iteration for comment). Ongoing.
678.5: JC to finalise the Storage background document by end September.
(UPDATE: 17 October meeting with Tony Medland – DB and PC will attend. This is almost complete and awaiting a few minor elements to be worked in – GR will upload into Googledocs for info). Ongoing.
680.2: JC will follow up GDPR implications relating to VOMS with DK. Ongoing.
684.1: PG will contact the owners of risks who were absent from today’s PMB to confirm they are satisfied with the decisions taken on the risk register.
684.2: DB will write to AD to suggest the Technical Group take on the coordinating of Rucio.

684.3: DB will contact the authors of the OSC documents asking them to complete their sections this week.