GridPP PMB Meeting 656

GridPP PMB Meeting 656 (15.01.18)
=================================
Present: Dave Britton (Chair), Pete Clarke, Jeremy Coles, David Colling, Pete Gronbech, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew McNab, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Tony Cass, Tony Doyle, Andrew Sansum.

1. Status of DMP
================
This has been created by PG based on previous documents and circulated to PPGP for comment/amendment. A reminder was issued and no response received as yet so the document has been circulated to the PMB with the caveat it has not had any final amendments. Any suggested amendments received will be circulated to the PIs.

2. Security Update – Meltdown and Spectre
=========================================
DK updated on the Meltdown and Spectre set of vulnerabilities which became public after New Year, we have a good track record here as RAL first noticed this and highlighted to the UK security team. A warning was circulated on 03.01.18 with updates the following week by the security officer on 11.01.18. TC circulated info from meetings at CERN last week. Speculative execution bypasses protection in the original fault and private information can be accessed by an unprivileged user. Initial advice was worker machines were first to be patched then others. There are – spectre variant 1, 2 and Meltdown (spectre 3). Various updates have been circulated and patches are now available, though for variant 2 a microcode is required. Performance degradation has been measured in some sectors, but the UK is fairly well on with deployment. Performance figures will be discussed at the WLCG management meeting tomorrow, but only a small percentage drop seems to be observed. Results from HEPSPEC are similarly low, but it is not clear if particular experiments experienced more challenges.
It is not thought GridPP was exposed, but in the longer term we need to factor into resource considerations, though we would respond quickly if that changes. Other communities had 30% degradation – virtual machines seem more sensitive. We should measure the HEPSPEC performance and monitor this after updates. It appears we are in good shape in terms of addressing this appropriately and timeously and taking all recommended mitigating steps. There is good sharing between our sites which brings in a great deal of information – HEPSYSMAN is meeting tomorrow in Glasgow. The PMB commend the timely response within our community and the excellent communication which is ongoing. It is to be hoped the degradation and performance resulting from patching is not significant, but this will be monitored.

3. OC docs (Risk register to be discussed on 22.1.18)
=====================================================
PG circulated an email noting actions required for preparing the OC documents and would like to discuss a complete first draft at the PMB on 29.01.18. This will allow a second draft to be produced by 02.02.18 and final draft by end February. The Risk register will be discussed on 22.01.18. Sections have previously been allocated to individual PMB members and these are being worked on.
Q317 reports are now received and PG will circulate asap. He has requested Q4 reports to be submitted by the end of January/very early February.

4. ResearchFish
===============
ResearchFish was once more discussed. The 5 Feb – 15 March review period will involve PG updating, PC can circulate information if required. This will require some updates from various members of the PMB.
There are two potential routes to take – either PG could await all experiments uploading all of their information then link our grants to those; or we can mine DOIs for experiments with a connection to us.

5. GDPR
=======
GDPR (General Data Protection Regulations) is a new EU legislation coming into force, it comes with significant fines if violated. DK will consider from GridPP’s perspective to ensure we are compliant as he is considering for EGI and WLCG at the moment. DK will report by end February on any actions GridPP need to take, mostly relating to Accounting and CA.
ACTION 656.1: DK will report before the end of February on any actions GridPP should take to comply with GDPR.

6. CPU Efficiencies
===================
GS circulated information last week for December and ATLAS has recovered but is still lower than usual, CMS is around 50%. DC confirmed work is urgently being undertaken on this. There are multiple causes and he will report to the PMB once this detailed review is complete.
RJ updated on ATLAS and advised a bug had been introduced to the system sending wrong numbers to jobs in some cases which caused inefficiencies, this has now been fixed.
Efficiencies for December show that LHCb is around 96% and NA62 has an efficiency of 98% – this uses Monte Carlo and before submitting jobs checks on responsiveness of sites then makes a choice based on current performance. AM suggested checking experiments’ rates which is useful to demonstrate our effectiveness.
ACTION 656.2: DC will report on CPU efficiencies.

7. Tier-1 procurement status
============================
AS is not present but GS provided a very brief update. AS circulated an email at end of December suggesting disk pricing was good and may liberate funds for CPU or tape. An update is required on the trajectory of this and GS will circulate it. If there is funding released the PMB should consider how best to spend it.

ACTION 656.3: GS will discuss Tier-1 procurement with Laura and Martin and report to the PMB.

8. GridPP40
===========
GridPP40 will take place in Pitlochry (Atholl Palace). Dell have expressed interest as a sponsor as sponsored GridPP36 at Pitlochry in April 2016. The PMB agreed that Dell should be approached to secure their sponsorship. PG will consider themes to be covered in the meeting – PMB members were asked to make suggestions that could be covered asap. PC reminded that an open invitation has been sent out to SKA and other external contacts to attend and they should be invited to attend and speak if they have updates. DB noted David Salmon has expressed an interest in attending. Registration will open at the beginning of February.

ACTION 656.4: DB will contact external contacts to invite them to attend and/or contribute to GridPP40.

9. AOCB
=======
a) Availability and Reliability
Figures circulate monthly for Tier1 and Tier2 at the bottom of Tier1 it notes to update Cath, who recently advised this is no longer necessary as RAL was the only Tier1 actioning this. GS will continue to include these figures in the Tier1 manager report.

10. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
The first meeting of 2018 will meet on Friday. DC will use this as an opportunity to invite contributions for GridPP40.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
RJ updated on a major file loss at Rutherford before Christmas, a small number of which were unique to RAL. It is not clear if this was a Castor issue but, but it may hasten the move away from Castor. Glasgow lost a disk server, which has now been recovered. There were some disk changes – Birmingham want to change and there were some delays in moving ECDF storage to ATLAS, particularly on the ATLAS production site. On the file loss at Castor the disk server lost to ATLAS resulted from a H/W problem due to a faulty disk (Discussed in the Tier1 Manager report).

SI-2 CMS Weekly Review and Plans (DC)
————————————-
CPU efficiencies have been discussed and are being worked on urgently.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
AM noted the loss of availability on 23 December which will disappear from reports – now sorted. There was an issue for RAL over the weekend relating to merging with Castor, also now resolved.

SI-4 Production Manager’s report (JC)
————————————-
1. “Spectre” and “Meltdown”

They are vulnerabilities in the design of the chip hardware, and cannot be fully resolved by patching operating systems. However patches are available which mitigate these problems.

“Meltdown” breaks down the boundary that separates user applications from accessing privileged system memory space. This vulnerability is confirmed to exist in all Intel processors since 1995, except for Intel Itanium and Intel Atom before 2013. It currently has one variant Rogue data cache load.

“Spectre” is similar but allows an attacker to utilize a CPU’s cache channel to read arbitrary memory from a running process. Unlike Meltdown, Spectre is confirmed to affect Intel, AMD, and ARM processors. It has two current variants: Bounds check bypass and Branch target injection.

There is an EGI wiki page on this topic for updates: https://wiki.egi.eu/wiki/SVG:Meltdown_and_Spectre_Vulnerabilities and a very good CERN security summary https://security.web.cern.ch/security/advisories/spectre-meltdown/spectre-meltdown.shtml.

The rest of this summary refers to the status at UK sites (and should not appear in the PMB minutes):

There are firmware releases coming out from the vendors. The releases do not appear to be following a clear pattern. Sites are picking them up. Some chips like Westmere are covered whilst others (e.g. E5620) are not.

Microcodes are being released ahead of RPMs. Sites are looking at these but it is quite a manual and intensive process to keep checking and updating. There were issues seen when some sites were not power cycling hardware but simply rebooting. Scripts are being shared to cross-check for the vulnerability once a patch is installed.

So far we do not have concrete figures for the performance impacts of the new firmware – in part because the fixes have come out in increments. HS06 testing is taking place on each of the generations installed at several sites and currently indicate “no significant degradation” in performance for WNs in HS06 (early on there were suggestions of 1%-3% depending on generation). The benchmarking working group are carrying out more systematic checks. Globally the suggestion is that performance impacts are less than feared: https://www.theregister.co.uk/2018/01/09/meltdown_spectre_slowdown//

I do not yet have a clear picture of results where processes have a lot of IO for our community. Some distributed parallel file systems have been reported as slowing between 10% and 40% for some IO intensive applications. Nothing approaching those figures has yet been reported for the WLCG community. The situation is being monitored.

2. December’s Tier-2 A&R reports are now available

ALICE: http://wlcg-sam.cern.ch/reports/2017/201712/wlcg/WLCG_All_Sites_ALICE_Dec2017.pdf
All okay.

ATLAS: http://wlcg-sam.cern.ch/reports/2017/201712/wlcg/WLCG_All_Sites_ATLAS_Dec2017.pdf
RHUL: 82%:82%
Sheffield: 30%:34%

CMS: http://wlcg-sam.cern.ch/reports/2017/201712/wlcg/WLCG_All_Sites_CMS_Dec2017.pdf
All okay

LHCb: http://wlcg-sam.cern.ch/reports/2017/201712/wlcg/WLCG_All_Sites_LHCB_Dec2017.pdf
RHUL: 82%:82%
Sheffield: 70%:70%

Sheffield reported that the issue they encountered followed the lcg-CA 1.88-1-patch. Not all users were successfully mapped and this led to SAM faiures.

3. There was a GDB at CERN last week: https://indico.cern.ch/event/651349/. The update by Graeme Stewart on the HSF Community White Paper may be of particular interest: https://indico.cern.ch/event/651349/contributions/2830237/attachments/1580497/2497360/cwp-gdb-january-2018.pdf.

4. A focus of the upcoming WLCG operations coordination meeting is the use and future of SAM tests.

5. There is an EGI operations meeting today. https://indico.egi.eu/indico/event/3245/. Most topical is the discussion on WMS decommissioning.

6. There is a HEPSYSMAN meeting this week in Glasgow: https://indico.cern.ch/event/686369/.

SI-5 Tier-1 Manager’s Report (GS)
———————————
· Network connectivity issues on Stack 9 in the UPS room overnight Thursday/Friday 21/22 December affected some internal systems (mainly monitoring). A member of staff attended overnight and a faulty transceiver was found to be the cause and replaced. External services were largely unaffected.
· Operations over the Christmas and New Year holiday were generally stable although not completely quiet for the oncall team. There were some Castor disk server failures and staff did attend site over the holiday to replace failed disk drives.
· In the first week back in the New Year disk server gdss745 (AtlasDataDisk – D1T0) failed with loss of all data on the server. The problem was triggered by a failed drive. However, errors seen on other disk drives while this one was rebuilding led to loss of the RAID6 array. There were around 960,000 files on the server – around half of which were unique. A post mortem investigation of this incident will be carried out.
· The termination of the WMS service had been announced – with a drain (i.e. not accepting new jobs) planned to start on 1st February. However, as the WMS is not being used by VOs at the moment and security patches need to be applied urgently the drain was brought forward and was started last Wednesday (10th Jan).
· Some problems this last weekend (13/14 Jan): LHCb are running merging jobs and this caused problems for Castor. There was also a problem for Castor Atlas – the SRMs stopped working for some hours. The SRMs started working on their own – and the problem is not understood. There was a problem for CMS AAA redirection which is not yet understood either.
· The ATLAS quota on Echo was increased by 500TB to 4.6PByts on 2nd January.
· Ongoing patching for Spectre and Meltdown.

These Availability figures for the Tier1 for December 2017 were:
Alice: 100%
Atlas: 99% (RAL-LCG2-ECHO “site” reporting 100%)
CMS: 99%
LHCb: 98%
OPS: 99.8%

Detailed causes of loss of availability:
LHCb: Problem on 23rd Dec. Monitoring problem at LHCb’s end – lack of availability seen across other Tier1s. Expect LHCb to fix up.
Atlas & CMS: Blips on 4th and 6th (latter maintenance)

4th: Problem in RAL core network affected Atlas and CMS availabilities in morning.

6th. Upgrade of Tier-1 non-LHCb SRMs to CASTOR 2.1.16-18 – affected Atlas and CMS availabilities
OPS: 22nd Dec. Daily figure was 92.4 Network problem – mainly affected monitoring – in early hours.

I sent round the CPU and Disk reports. For completeness here are the main parts for completeness:

Global CPU efficiency (CPU time / wall time) was down in December at 83.5%, compared with 83.6% in November. Of 210970 HEP-SPEC06 months available wall time, 206555 HEP-SPEC06 months were used (97.9% occupancy).

Experiment summary:

Experiment CPU Time Wall Time Wait % Efficiency
HEP-SPEC06 Months
ALICE 43841.92 48511.49 4669.57 90.37
ATLAS 41701.61 53389.48 11687.87 78.11
CMS 14542.06 28482.56 13940.50 51.06
LHCb 57329.17 59856.82 2527.66 95.78

LHC Total 157414.76 190240.35 32825.59 82.75

Here is some information about disk deployment at RAL for December 2017. Units are TB (10^12 B).

Experiment Allocation Deployment
CASTOR Echo

ALICE 505.0 544.5
ATLAS 7950.0 5256.9 3100.0
CMS 4304.0 3207.6 2500.0
LHCb 6162.0 5294.6 1500.0

LHC Total 18921.0 14303.5 7100.0

Non-LHC 845.0 1354.2

Total 19766.0 15657.7 7100.0

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
There is an MB meeting tomorrow and DB will report next week.

SI-7 External Contexts (PC)
———————————
PC advised he, DC and Jeremy Yates attended a meeting at BEIS HQ last week to meet vendors, including Google and AliBaba Cloud. This was useful as they were very open regarding needing a more open model etc. One issue raised repeatedly related to big use of resource except for the UK and PC noted we could do this so long as we were given credit. Estimates of c. 3 x FTEs for one year would facilitate this and there is willingness if we can get staff. DC confirmed Google is keen to work with us on this with 15% egress being usually free and perhaps more. It could be done with less staff if students were involved and this has been raised at the CMS taskforce, though at GridPP level more manpower would be required. Oxford had a pilot project recently with Google which proved more technically challenging than expected, though these are issues Google are currently working on.

REVIEW OF ACTIONS
=================
644.3: AS put together a starting plan for staff ramp-down. (Update: a draft will be produced in January). Ongoing.
644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY?) Ongoing.
647.2: DB will circulate link for Data Management Plan once agreed. (Update: PPGP responded to PG advising they would provide comments, this is a departure from previous practice.) Done.
649.1: DB will write Introduction of OS documents. Ongoing.
649.2: PC will write Wider Context of OS documents. Ongoing.
649.3: PG will schedule a discussion of the Risk Register at a PMB meeting in December then update this in the OS documents. Ongoing.
649.4: GS and AS will write the Tier1 Status section of OS documents. Ongoing.
649.5: JC will write Deployment Status section of OS documents with input from PG. Ongoing.
649.6: RJ, DC and AM will write LHC section of User Reports in OS documents. Ongoing.
649.7: JC will write Other Experiments section of User Reports in OS documents with input from DC and PG. Ongoing.
654.1: PG will telephone Mark Sutton to arrange for him to have access to GridPP. Ongoing.
654.2: RJ will investigate ATLAS CPU efficiencies and report to PMB. Done.
654.3: AS will prepare a report summarising and updating the current staffing situation. Done.
655.1: DC will discuss migration from WMS with T2K. Ongoing.
655.2: AS to prepare a report on failure of the generator to come up after a recent issue. Ongoing.
655.3: PG to consider the agenda and date for Tier1 review and include disaster recovery plans. Ongoing.

ACTIONS AS OF 15.01.18
======================
644.3: AS put together a starting plan for staff ramp-down. (Update: a draft will be produced in January). Ongoing.
644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY?) Ongoing.
649.1: DB will write Introduction of OS documents. Ongoing.
649.2: PC will write Wider Context of OS documents. Ongoing.
649.3: PG will schedule a discussion of the Risk Register at a PMB meeting in December then update this in the OS documents. Ongoing.
649.4: GS and AS will write the Tier1 Status section of OS documents. Ongoing.
649.5: JC will write Deployment Status section of OS documents with input from PG. Ongoing.
649.6: RJ, DC and AM will write LHC section of User Reports in OS documents. Ongoing.
649.7: JC will write Other Experiments section of User Reports in OS documents with input from DC and PG. Ongoing.
654.1: PG will telephone Mark Sutton to arrange for him to have access to GridPP. Ongoing.
655.1: DC will discuss migration from WMS with T2K. Ongoing.
655.2: AS to prepare a report on failure of the generator to come up after a recent issue. Ongoing.
655.3: PG to consider the agenda and date for Tier1 review and include disaster recovery plans. Ongoing.
656.1: DK will report before the end of February on any actions GridPP should take to comply with GDPR.
656.2: DC will report on CPU efficiencies.
656.3: GS will discuss Tier-1 procurement with Laura and Martin and report to the PMB.
656.4: DB will contact external contacts to invite them to attend and/or contribute to GridPP40.