GridPP PMB Meeting 633

GridPP PMB Meeting 633 (15/05/17)
=================================
Present: Dave Britton(Chair), Tony Cass, Pete Clarke, Jeremy Coles, David Colling, Tony Doyle, Roger Jones, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Dave Kelsey, Pete Gronbech.

1. Quarterly Reports
====================
PG confirmed that All Q416 reports have been received and a summary circulated. He also confirmed that for Q117 he is waiting for Tier-1, Ops reports which are required to prepare the OC documents. JC’s report is dependent on other elements and almost ready for submission. AS has finance and staff data prepared for inclusion and GS is working on metrics for the Tier-1 report which will be submitted shortly.

2. OC Docs
==========
There has been very limited input to the OC document so far and PG issued a reminder that this is a matter of urgency. DB reiterated this needs to be a high priority and members should provide text for relevant sections in the next few days.

3. AOCB
=======
a) GridPP Endorsement of WISE SCI at TNC17 (Linz)
DK circulated an email requesting that GridPP endorses the intention of a document re SCI Version 2. DB summarised this sets out a process that collaborating infrastructures should adhere to, notwithstanding that individual collaborations will have additional security measures. This seeks to assess readiness levels in meeting security requirements, e.g. a team member allocated to deal with security – each team should go through items on the list and self-assess (operational security, instant response, participant response/responsibility, etc). DK is seeking the endorsement and a quote to support a TNC17 conference section as it sets a bar for infrastructures to collaborate. There will be a ceremony and a presentation from WISE at the conference which only DK will attend to represent GridPP. DB has drafted a quote and circulated for comment. There were no objections and DB will circulate the endorsement and proposed statement for DK to progress.

b) HW costs
AS summarised he and PG assessed HW purchase planning to set approximate costs for Tier-1 procurement based on previous experience and discrepancies between Tier-1 prices, i.e. what was planned and what was paid. Most of the discrepancy seems related to the HW configuration selected – memory and GB per core provision are the main drivers. It is likely the price can be reduced to around £7 per HEPSPEC next time for CPU. However; there should be an open discussion in Tier-1 to decide on resource requirements and how much resource should be allocated to each core since this is the big price driver. The main interest of the Hardware Advisory Group (HAG) in the past was to look at procurement, but since this is no longer feasible due to procurement rules it may be best to consider reverting to the previous system and focus the HAG on technical definition. With disk this is much more complex. AS has re-jigged price points and can progress. DB agreed a HAG should be instigated earlier in the process, i.e. now. The tender process on this occasion may have impacted the process, but this is manageable. It was suggested that internal dialogues in Tier-1 should be integrated to the HAG discussions with staff involved.
ACTION 633.1: AS will put together a proposal for HAG.

4. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
There was a meeting last week but this was straightforward and covered the standing items with nothing significant to report.

SI-1 Dissemination Report (SL)
——————————
SL reported that the dissemination post has now been advertised.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing of significance to report.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Nothing of significance to report. Looking at difficulties is ongoing and it may be that the evolution group should progress.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
AM had nothing operational to report. There is a computer review this week which may result in something to report.

SI-5 Production Manager’s report (JC)
————————————-
1. We have Tier-2 A/R figures for April 2017. The results are:

ALICE: http://wlcg-sam.cern.ch/reports/2017/201704/wlcg/WLCG_All_Sites_ALICE_Apr2017.pdf

All okay.

ATLAS: http://wlcg-sam.cern.ch/reports/2017/201704/wlcg/WLCG_All_Sites_ATLAS_Apr2017.pdf

QMUL: 82%:83%

Birmingham: 86%: 86%

RALPP: 86%:86%

CMS: http://wlcg-sam.cern.ch/reports/2017/201704/wlcg/WLCG_All_Sites_CMS_Apr2017.pdf

RALPP: 78%:78%

LHCb: http://wlcg-sam.cern.ch/reports/2017/201704/wlcg/WLCG_All_Sites_LHCB_Apr2017.pdf

QMUL: 65%:66%

RALPP: 72%:72%

Site responses:

QMUL: No response.

Birmingham: Power issues over Easter.

RALPP: Firewall bypass network link went down the day before the Easter break and remained down until the day after Easter.

2. There was a GDB last week (https://indico.cern.ch/event/578986/). Topics covered included an overview of HEPiX and an update on the Data Management Steering Group (will report at the WLCG workshop). The afternoon sessions were on data federations.

3. The GOCDB information review is now complete (https://www.gridpp.ac.uk/wiki/2017_GOCDB_REVIEW). GridPP sites are up-to-date.

4. A review of GridPP site storage approaches and enablement has started (overview at https://www.gridpp.ac.uk/wiki/Storage_site_status).

5. perfSONAR version 4 was released on 17th April. The initial updates/installs have had varying success on our sites.

6. The EGI Conference 2017 (and Indigo summit) took place last week. It is worth skimming the agenda to get a sense of EGI directions (https://indico.egi.eu/indico/event/3249/timetable/#all.detailed).

SI-6 Tier-1 Manager’s Report (GS)
———————————
Infrastructure:
• As previously reported, on Friday 28th April there was a problem with the UPS for building R89. The UPS switched itself into “bypass” mode – which effectively means we have no UPS. We have run in this way since then. On Thursday and Friday last week the old UPS was removed and a replacement unit put in its place. The new one is being commissioned now and will be turned on shortly.

Castor:
• On Tuesday of last week (9th May) the central Castor components were updated to version 2.1.16. Then, on Thursday the LHCb instance along with the LHCb SRMs was upgraded to version 2.1.16. (The LHCb SRMs have previously been upgraded and then downgraded back as a result of the problems encountered). So far this has worked well although load from LHCb has not been particularly high. The Castor team are at a face-to-face meeting at CERN this week and it is likely a stress test will be carried out in conjunction with LHCb.

Networking:
• Central networking are investigating a problem with the site firewall that seems to affect some data flows – in particular it has affecting videoconferencing. It is not clear if this could be having any effect on our services. Our main data flows do not go through the firewall, although data requests made by the worker nodes do.

Availabilities for April 2017:

These figures were:
Alice: 100%
Atlas: 89%
CMS: 90%
LHCb: 92%
OPS: 98%
(I have again included the OPS availability figures although these are not in the WLCG reports.)

Overall these are clearly a very poor set of results. To pick out some specific points:
• Atlas – very poor on 30/4. This is when the AtlasDataDisk became full and tests failed. Not clear if this is our fault owing to a slower file deletion rate in Castor as compared to other sites. However, intermittent test failures through much of month. There was a further specific problem on the 12th April with an incorrect certificate ownership on some disk servers. I note we did upgrade the Atlas SRM – as for LHCb (but not downgrade it again – unlike LHCb).
• CMS: Very bad around 8-10 April: A problem with a hypervisor caused problems for the CEs and argus – this affected CMS CE (glexec) tests. Also, there was a problem with the Castor transfermanager that was not resolved quickly. The final quarter of the month saw a lot of intermittent SRM test failures. These were not investigated fully as effort has focused on the Castor upgrade.
• LHCb: Intermittent SRM test failures throughout the month. We were aware of LHCB problems with Castor during their stripping/merging campaign and again effort was concentrated on the Castor upgrade as a longer term solution.

Job Efficiencies – April 2017

Firstly: Here is Andrew Lahiff’s report for the month as far as the LHC experiments goes:

++++++++++++++++++++++++++++++
Global CPU efficiency (CPU time / wall time) was down in April at 67.0%, compared with 72.5% in March. Of 227937 HEP-SPEC06 months available wall time, 214934 HEP-SPEC06 months were used (94.3% occupancy). Experiment summary:

Experiment CPU Time Wall Time Wait % Efficiency
HEP-SPEC06 Months
ALICE 29006.30 42771.08 13764.78 67.82
ATLAS 73308.50 109347.02 36038.52 67.04
CMS 21758.12 37109.66 15351.55 58.63
LHCb 13312.56 17913.99 4601.42 74.31

LHC Total 137385.48 207141.75 69756.27 66.32
++++++++++++++++++++++++++++++

Factors possibly affecting efficiencies:
• Castor: We know that we have had ongoing problems here. In particular for LHCb. We note that LHCb efficiencies have recovered since the stripping/merging campaign finished.
• CMS: Several changes have been made by CMS: Lazy downloads turned off, move from rfcp (rfio) to gfal2 (gridFTP) for data uploads. A comparison shows that our efficiencies for CMS for April were in the pack of other Tier1.s
• Atlas: Here we are around the bottom of the pack (some plots attached).
• We have been making changes to the batch system itself. Worker nodes have been progressively moved from running SL6 to running SL7 with the batch jobs in SL6 containers.
• Cannot exclude an effect from the site firewall referred to above).

(DC noted for CMS that issues are experienced running jobs at RAL if data is run remotely but not if run locally – for lazy downloads Andrew Lahiff has advised this discrepancy has been reduced after a recent update).
DB thanked GS for the input and requested the situation be closely monitored in the near future, particularly availability which should see improvements post-Castor upgrade.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
There has been no meeting and nothing to report.

SI-8 External Contexts (PC)
———————————
PC attended the Scientific Forum and advised there was little of note except 3 talks. Ian Byrne discussed cooperation between CERN and SKA – SKA Manchester members will soon visit CERN to liaise with representatives of Particle Physics.

REVIEW OF ACTIONS
=================
NEW ACTION: LC Will check 16-18th April for GridPP40 at Durham (beginning of the week – CHEP is later in the week and wait to see when IOP). (UPDATE: Durham cannot accommodate these dates – all potential dates have been exhausted and an alternative venue should be considered) Done.
630.2: DB and PG will continue to work on metrics and funding strategies at the macro level. Ongoing.
630.3: DB will tweak his metrics and funding model based on CPU. Ongoing.
631.1: PG will create a summary spreadsheet of the 2016 Experiment Review figures to extract important figures for the OC. Done.
631.2: ALL to work on OC documents for submission by end May. Ongoing.
632.1: DB will work on the Introduction of OC doc. Ongoing.
632.2: DB and PC will work on Wider Context section of OC doc. Ongoing.
632.3: PG will work on PI5 status and Risk Register of OC doc. Ongoing.
632.4: AS will work on Tier-1 section of OC doc. Ongoing.
632.5: JC will work on Deployment Status of OC doc. Ongoing.
632.6: RJ will work on ATLAS section of OC doc. Ongoing.
632.7: DC will work on CMS section of OC doc. Ongoing.
632.8: AM will work on LHCb section of OC doc. Ongoing.
632.9: JC and DC will work on Other VOs section of OC doc. Ongoing.
632.10: SL will work on Impact and Dissemination section of OC doc. Ongoing.
632.11: GS will report next week on low efficiencies experienced at RAL in April. Done.

ACTIONS AS OF 15/05/17
======================
630.2: DB and PG will continue to work on metrics and funding strategies at the macro level. Ongoing.
630.3: DB will tweak his metrics and funding model based on CPU. Ongoing.
631.1: PG will create a summary spreadsheet of the 2016 Experiment Review figures to extract important figures for the OC. Done.
631.2: ALL to work on OC documents for submission by end May. Ongoing.
632.1: DB will work on the Introduction of OC doc. Ongoing.
632.2: DB and PC will work on Wider Context section of OC doc. Ongoing.
632.3: PG will work on PI5 status and Risk Register of OC doc. Ongoing.
632.4: AS will work on Tier-1 section of OC doc. Ongoing.
632.5: JC will work on Deployment Status of OC doc. Ongoing.
632.6: RJ will work on ATLAS section of OC doc. Ongoing.
632.7: DC will work on CMS section of OC doc. Ongoing.
632.8: AM will work on LHCb section of OC doc. Ongoing.
632.9: JC and DC will work on Other VOs section of OC doc. Ongoing.
632.10: SL will work on Impact and Dissemination section of OC doc. Ongoing.
633.1: AS will put together a proposal for HAG.