GridPP PMB Meeting 618

GridPP PMB Meeting 618 (19.12.16)
=================================
Present: Dave Britton(Chair), Tony Cass, Pete Clarke, Jeremy Coles, David Colling, Tony Doyle, Pete Gronbech, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies:

1. Tier-1 Procurement
=====================
AS provided an update on the Tier-1 Procurement.

2. AOCB
=======
a) PDG Gender Survey
The PDG report on gender equality in HPC” was noted with interest.

3. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
Nothing of significance to report.

SI-1 Dissemination Report (SL)
——————————
## GridPP Engagement Officer Notes for PMB

### GridPP Case Studies

These – including a bonus case study from EGI [1] – are being written up as a public document with doc ID GridPP-ENG-001-CaseStudies – link, DOI, etc. to follow.

[1] https://www.egi.eu/use-cases/research-stories/cetaceans/

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing of significance to report.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Nothing of significance to report.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
Nothing of significance to report.

SI-5 Production Manager’s report (JC)
————————————-
Operations updates/news for the last week:

1. The DNS record for planet.gridpp.ac.uk has moved to the new VM host.
2. As requested at the EGI OMB, we are putting forward one or two sites to support the (DPM/dCache) accounting pilot (run by John Gordon).
3. We have received notification of the November T2 R/A figures.
ALICE (http://wlcg-sam.cern.ch/reports/2016/201611/wlcg/WLCG_All_Sites_ALICE_Nov2016.pdf)
– All okay

ATLAS (http://wlcg-sam.cern.ch/reports/2016/201611/wlcg/WLCG_All_Sites_ATLAS_Nov2016.pdf)
– RHUL 89%:91%
– Glasgow 85%:85%

CMS (http://wlcg-sam.cern.ch/reports/2016/201611/wlcg/WLCG_All_Sites_CMS_Nov2016.pdf)
– All okay

LHCb (http://wlcg-sam.cern.ch/reports/2016/201611/wlcg/WLCG_All_Sites_LHCB_Nov2016.pdf)
– Liverpool 33%:33%
– Glasgow 28%:28%

Explanations:

RHUL: There was a DPM database move scheduled during the month.

LHCb results were poor (for Liverpool and Glasgow) due to SRM tests which were false positives. This has been corrected and re-computations are being requested.

Glasgow (ATLAS): 1-2 November – Downtime due to Power failure. 7 November – DPM pool node disk042 caused issues with DPM headnode. 17-18 November – DPM pool nodes disk042/disk070 caused issues with DPM headnode freezing SRM.

4. The EGI Security Policy Group has produced a revised draft version of the top-level Security Policy bringing the document up to date in terms of terminology and with the current set of security policy documents. This is being reviewed.

5. A DUNE user submitted 100000 jobs last week, this uncovered that the DIRAC instance at IC could only handle 41000. A problem was seen running the DUNE work on CernVM as the software tries to check the kernel version before starting. These issues are being followed-up.

6. Preliminary results from the WLCG lightweight sites survey are available (https://indico.cern.ch/event/540424/contributions/2194899/subcontributions/212150/attachments/1381450/2100252/LW-sites-161201-v11.pdf), but the survey remains open and all sites are encouraged to complete it. (The UK response so far has been good.) The conclusion thus far is that an early area to target is increased use of shared repositories (e.g. for OpenStack, Docker and Puppet images).

For those wanting a more fine grained view, minutes from the last weekly ops meeting can be found here: https://indico.cern.ch/event/593440/attachments/1383735/2109502/OpsMinutes-06-12-2016.pdf.

SI-6 Tier-1 Manager’s Report (GS)
———————————
Here is a brief report from the Tier1.

Castor:
– Firmware updates on the RAID cards in the ClusterVision ’13 batch of disk servers were successfully carried out last Wednesday.
– The Castor 2.1.15 update in schedule for January. The dates are announced via the GOC DB.
– There were load problems on the CMS Tape instance during last week. This affected availability tests.

Tape:
– Migration of LHCb data from ‘C’ to ‘D’ tapes ongoing. Now a little over 50% done. Around 440 out of the 1000 tapes still to do.

Services:
– There have been changes to the FTS services (test and production) to fix a certificate problem that was triggered by a recent FTS update.

Christmas on-call plans follow the usual pattern. These have been announced in the Tier1 BLOG:

RAL Tier1 – Plans for Christmas & New Year Holiday 2016/17

SI-7 LCG Management Board Report of Issues (DB)
———————————————-
No report.

SI-8 External Contexts (PC)
———————————
Nothing of significance to report.

REVIEW OF ACTIONS
=================
610.1: AS/GS Produce suggestions for one or more metrics that will summarise the Tier-1 network availability/performance. Ongoing.
612.3: PG will determine which small sites can undertake procurement this FY. (Update: DB and RJ both had notices from JES system confirming their applications have been approved. PG will discuss with each PI to establish figures and remind them imminent action must be taken). Ongoing.
616.2: AS will update the PMB on Tier-1 procurement by next week. Ongoing.
616.3: AS & GS to undertake a sanity check on Janet. (UPDATE: Routing was checked to ensure Tier-1 traffic did not go out over Janet rather OPN. STF site transfers were also checked and will be summarized soon, most data is non-UK destined. Need to look at the federated access and flow level information and understand at a VO level require to be checked. A large test de-bug flow was picked up and eliminated). Ongoing.
616.4: DB and SL will discuss how best to progress replacement of TW’s role. Ongoing.
617.1: ALL to review and comment on the Tier-2 Evolution document this week to agree a final version next week.
617.2: DC will append a statement to the Tier-2 Evolution document on CMS requirements.
617.3: RJ will establish a priority order for resources to address issues arising.
617.4: JC will document what sites and periods CPU is idle and could be used elsewhere and will summarise in an email to the PMB.
617.5: PG will discuss with Ulrich requirements for GANGA going forward and report back to the PMB.
617.6: TC will discuss with Romain to consider submitting an abstract for CyberUK 2017.
617.7: SL will look into possible saturation at 10% level for LHCBo jobs and determine if more resources should be allocated.

ACTIONS AS OF 19.12.16
======================
610.1: AS/GS Produce suggestions for one or more metrics that will summarise the Tier-1 network availability/performance. Ongoing.
612.3: PG will determine which small sites can undertake procurement this FY. (Update: DB and RJ both had notices from JES system confirming their applications have been approved. PG will discuss with each PI to establish figures and remind them imminent action must be taken). Ongoing.
616.2: AS will update the PMB on Tier-1 procurement by next week. Ongoing.
616.3: AS & GS to undertake a sanity check on Janet. (UPDATE: Routing was checked to ensure Tier-1 traffic did not go out over Janet rather OPN. STF site transfers were also checked and will be summarized soon, most data is non-UK destined. Need to look at the federated access and flow level information and understand at a VO level require to be checked. A large test de-bug flow was picked up and eliminated). Ongoing.
616.4: DB and SL will discuss how best to progress replacement of TW’s role. Ongoing.
617.1: ALL to review and comment on the Tier-2 Evolution document this week to agree a final version next week.
617.2: DC will append a statement to the Tier-2 Evolution document on CMS requirements.
617.3: RJ will establish a priority order for resources to address issues arising.
617.4: JC will document what sites and periods CPU is idle and could be used elsewhere and will summarise in an email to the PMB.
617.5: PG will discuss with Ulrich requirements for GANGA going forward and report back to the PMB.
617.6: TC will discuss with Romain to consider submitting an abstract for CyberUK 2017.
617.7: SL will look into possible saturation at 10% level for LHCBo jobs and determine if more resources should be allocated.