GridPP PMB Meeting 636

GridPP PMB Meeting 636 (12.06.17)
=================================
Present: Pete Gronbech (Chair), Tony Cass, David Colling, Roger Jones, Dave Kelsey, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Dave Britton, Pete Clarke, Jeremy Coles, Tony Doyle, Steve Lloyd, Andrew McNab, Andrew Sansum.

1. OC Talk
====================
PG has added a finance page to DB’s proposed talk. DK noted the travel budget being slightly over but this resulted from invoices accrued for c. £5K in April which may result in next year’s budget running slightly under. PG may insert an explanatory sentence on this. Resource/capital split is down to changes – we were asked to spend £390K, and underspent on staffing which led to the net spend being close to originally planned. H/w grants at Tier2 were spent.

2. AOCB
=======
None.

3. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
No report submitted.

SI-1 Dissemination Report (SL)
——————————
No report submitted.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing of significance to report.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Nothing of significance to report.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
No report submitted.

SI-5 Production Manager’s report (JC)
————————————-
No report submitted.

SI-6 Tier-1 Manager’s Report (GS)
———————————
Castor:
———
– The upgrade of Castor to version 2.1.16 was completed successfully. One problem emerged for LHCb where the SRM returned a TURL that would not work when used for xroot access owing to an incorrect hostname. This was resolved by installing an xroot manager was installed on the (LHCb) stager.

Echo:
—–
– Atlas carried out a large deletion test of files in Echo around the end of May. Overall the results were pleasing with the bulk of the files deleted successfully and in a reasonable timescale. However, some (one to two thousand) files failed to delete. Most of these were cleaned up manually leaving a handful for debugging purposes. However, the system subsequently also deleted these remaining files without manual intervention.

Batch and Services:
– All CEs now migrated to use the load balancers in front of the argus service.
– A start has been made enabling XRootD gateways on worker nodes for Echo access. This will be ramped up to one batch of worker nodes.
– Batch access has been enabled for LIGO and the MICE pilot role.

Networking:
—————
– We are tracking the ongoing problem with the site firewall that affects data flows.
– Implementation of the third 10Gbit link for the OPN to CERN is ongoing. We are hoping to do this on Wednesday (14th June).

Availabilities for May 2017:
———————————
Alice: 100%
Atlas: 90%
CMS: 84%
LHCb: 99%

– For Atlas: The availability was badly affected by disk areas being full at the start of the month. (E.g. zero percent availability on the 1st May). This can be regarded as being Atlas’ problem – although the rate at which Castor has been able to delete files has also played a role. in the second half of the month availability was good.
– For CMS: There has been a rate of SRM tests failures through the month which is the main cause of the very poor availabilities. This was exacerbated by a couple of other problems: One weekend when there were some specific problems with the CMS Castor instance; A day when the CE “glexec” tests largely failed (linked to argus problems).

Efficiencies Comment:
—————————-
Here is Andrew Lahiff’s Job Efficiency Report:
+++++
Global CPU efficiency (CPU time / wall time) was up in May at 77.3%, compared with 67.0% in April. Of 223885 HEP-SPEC06 months available wall time, 212546 HEP-SPEC06 months were used (94.9% occupancy). Experiment summary:

Experiment CPU Time Wall Time Wait % Efficiency
HEP-SPEC06 Months
ALICE 13709.33 19911.03 6201.70 68.85
ATLAS 84993.16 96536.33 11543.16 88.04
CMS 20244.83 39687.68 19442.85 51.01
LHCb 32482.93 41787.65 9304.72 77.73

LHC Total 151430.25 197922.69 46492.44 76.51
++++
Some additional notes on CPU efficiencies:
* Since the Castor 2.1.16 update (23rd May) Atlas has seen good efficiencies of above 90%.
* Since resolving the problem for LHCb where incorrect information was returned by the SRM for xroot transfers at the end of May their efficiency has been up in the high nineties.
* CMS job efficiencies remain a concern (around 50%).

DC will be investigating the data and report more fully in due course. He noted several issues ongoing relating to scheduling and will contact RAL. Some interesting numbers are coming out, RAL and Tier1 are looking at differenes between these and CMS. DC reiterated preliminary figures highlighting differences in efficiency between on-site jobs and remote jobs. Work is ongoing in this regard.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
No report submitted.

SI-8 External Contexts (PC)
———————————
No report submitted.

REVIEW OF ACTIONS
=================
630.2: DB and PG will continue to work on metrics and funding strategies at the macro level. Ongoing.
630.3: DB will tweak his metrics and funding model based on CPU. Ongoing.
633.1: AS will put together a proposal for HAG. Ongoing.
635.1: PG will make all the final suggested changes and submit to OC documents. Done.

ACTIONS AS OF 12.06.17
======================
630.2: DB and PG will continue to work on metrics and funding strategies at the macro level. Ongoing.
630.3: DB will tweak his metrics and funding model based on CPU. Ongoing.
633.1: AS will put together a proposal for HAG. Ongoing.