GridPP PMB Meeting 620

GridPP PMB Meeting 620 (16.01.17)
=================================
Present: Dave Britton(Chair), Tony Cass, Pete Clarke, Jeremy Coles, David Colling, Pete Gronbech, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Tony Doyle, Roger Jones, Dave Kelsey

1. Tier-1 Review
================
Agenda suggestions – AS suggests finishing around 4pm to allow return travel. AS strategic aspects of GridPP5, departmental aspects. Staffing plans and issues. The challenge of evolution into second half of the project. DB noted it should be remembered this is a Tier-1 review. The amount of context should be minimal and it should focus on Tier-1 performance, including: Tier-1 manager role, production team (after years of stability this has recently undergone big changes), who are the critical people, concern that a lot of work and development depend on Alastair and Andrew – given AL’s change in role this highlights a potential vulnerability; staffing matters; cover some core development projects before lunch; CEPH – a number of CEPH topics and other important subjects to be discussed. DB reiterated not to get bogged down in technical details during the review, what is vitally important is the timeline and the risks that remain in the project.
Cloud; A lot done by other projects – AS would like Ian Collier to make that clear.

IPv6 is important, there is an April deadline – this could be bundled in with the Networking section as a lot of the network iterations are related to IPv6 and lots of people involved (e.g. DK, Alastair, Martin…).
Work load management, Andrew Lahiff is doing a lot in this area – it is not clear what the long term vision is here, and how we plug into other parts of SCD. Gareth will summarize, last 12 months. What’s been going on and what are the operational issues, capacity planning and h/w status. DB noted that, in some ways, this is something we collectively do, how the T1 implements the plan is more relevant.
Operational issues around the CASTOR/ CEPH move; Hardware generations – how they are panning out; Rob – Castor status and plans.
Short term operations, performance, plans for upgrades, consolidation, tape planning.
CERNs move away from CASTOR to CTA.
Catalin can discuss grid stuff, services for WLCG, CVMFS, LFC, Conditions, Frontier. Can we run all these services with reduced staff count?
CVMFS is growing in operational use and effort required.

Losing 3 FTE’s , AS will do some modelling to show what this means and how we address it.
LFC is basically a t2k thing. This has an Oracle db behind it.
Where to handle security.
Slide or two on config management.
Quattor has been used for many years. Quattor now called Aqualon.
There is a lot to cover but we must have time for discussion and questions.

2. Support for MoBrain
======================
We should formalise our offer to MoBrain – they were keen to engage with wider communities, in ways that may lead to funding in due course and we want to be a good citizen in the EGI forum. There is some slight risk of over contention. Very little effort to continue the commitment for opportunistic access to ~ 50 cores. It is therefore strategically good to continue.

3. Custodial tape for the SoLid Experiment
==========================================
DC had raised the issue of tape storage for the SoLid experiment – up to 1PB over 5 years. Their computing model is in development, they are missing custodial tape storage. Various sites are involved: IC, Oxford, Bristol etc. They run data minimisation in Brussels and it would be really useful to store at T1. What is the likely ramp up over next 3 years? Data starts at end of this year, by the end of GridPP5 they will have ~0.5PB.

We do not know what the situation will be beyond 2020 and would not be looking to charge at this initial stage. We could warn SoLid that other groups such as DiRAC are paying for space. £30K /year for 5PB quoted for DiRAC. The SoLid experiment is supported in the UK to the extent of a few fellowships so it is similar to NA62.

4. UKT0
=======
Report, little progress. Several emails from BEIS asking for different things and Charlotte Jamieson requested a list of things we could spend in 2017.
PC suggested we Need money to pump prime, UKT0. We could ask for £1.5M some (~50%) to go to R89 to help infrastructure for others… e.g. Our promise to LSST, could be diverted to RAL for example.
Could send other half £750K, to other sites to support non-PP work. E.g. Cambridge and Edinburgh for data centres for Astro….
AS and David Corney could add up ALC? Must make sure we give a consistent story.
Could buy DIRAC (and others) tape with the money for example?

5. GridPP communications in event of an incident
================================================
There was some discussion on how to handle a PR issue in the event of an incident. There was previously an action on DK to provide some text and there was discussion on who would speak for the project in the event of a PR issue, including STFC reputational thing; would an emergency PMB be required; should we go through STFC comms people; if the breach is a particular University then they would also have a team responsible; and the response would depend on whose reputation is on the line, Uni, GridPP or STFC.
From 15th July 2015 the latest version was “”All security issues including security incidents are handled by GridPP’s security team according to an internationally agreed set of procedures. As such GridPP cannot respond to your query at this time but will, as needed, prepare a statement in due course.”
Action 620.1: DB to discuss communication in the event of an incident with DK.

6. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
Nothing significant to report.

SI-1 Dissemination Report (SL)
——————————
Nothing significant to report. SL has discussed with Jon the possibility of splitting a role: 2 x 0.5 FTEs that could be joined if the appropriate person applied, it was recognised that finding someone with Tom’s skillset will be challenging.
Action 620.2: SL will write job descriptions (2 x 0.5 FTEs) to replace Tom.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing significant to report.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Nothing significant to report.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
Nothing significant to report.

SI-5 Production Manager’s report (JC)
————————————-
1. The WLCG Tier-2 reliability/availability results for December 2016 are as follows:

ALICE: http://wlcg-sam.cern.ch/reports/2016/201612/wlcg/WLCG_All_Sites_ALICE_Dec2016.pdf
All okay.

ATLAS: http://wlcg-sam.cern.ch/reports/2016/201612/wlcg/WLCG_All_Sites_ATLAS_Dec2016.pdf
Oxford: 73%:73%
RALPP: 80%:80%

CMS: http://wlcg-sam.cern.ch/reports/2016/201612/wlcg/WLCG_All_Sites_CMS_Dec2016.pdf
RALPP: 89%:89%

LHCb: http://wlcg-sam.cern.ch/reports/2016/201612/wlcg/WLCG_All_Sites_LHCB_Dec2016.pdf
RALPP: 82%:82%

RALPP responded:

a) High load on the SRM from CMS causing issues for all VOs. This is ongoing and being worked on but not enough by itself to put us below the threshold.
b) Annual certificate renewal in December undertaken but did not restart the Argus services – therefore they did not pick up the changed certificate and so the certificates expired just after we left for Xmas. It took a few days to work out what was going on and fix it.

2. A WLCG operations coordination meeting this will take place on 26th January. There are two focus topics: a. Downtimes proposal follow-up and b. tape usage performance analysis.

3. A summary of the WLCG networking pre-GDB that took place last week can be found here: https://indico.cern.ch/event/578982/contributions/2418711/attachments/1394050/2124515/Report_on_the_Pre-GDB_on_Networking.pdf.

4. There was a short GDB last week (http://indico.cern.ch/event/578982/). Topics included an update on the HNSciCloud work (http://indico.cern.ch/event/578982/contributions/2418697/attachments/1393822/2124112/2017-01-11-GDB-HNSciCloud.pdf ) , a primer from Alessandra on the WLCG workshop to be held in Manchester in 19th-21st June (http://indico.cern.ch/event/578982/contributions/2423302/attachments/1393913/2124295/20170111_GDB_manchester.pdf) and an IPv6 update from Dave (http://indico.cern.ch/event/578982/contributions/2418700/attachments/1393936/2124385/Kelsey11jan17.pdf).

5. Since just before Christmas Oxford has been running as a CMS diskless Tier-2 using disk at RALPP. Tests as of last week suggested everything was running correctly and there were no issues with disk server loads.

6. At the ops meeting last week Andrew M reported that there has been a recent VAC release that can handle differently sized VMs. Two weeks ago VACmon went live. This reports monitoring information like Ganglia.

SI-6 Tier-1 Manager’s Report (GS)
———————————
Castor:
– There have been some issues with Castor LHCb – for a period (a day or so) there was a high rate of failures copying from castor-disk to castor-tape. There is also an ongoing problem accessing some specific files.
– The Castor 2.1.15 update: The first step – the update if the Nameserver component was successfully carried out last Tuesday. The “repack” instance is being updated today and then the LHCb stager this Wednesday (18th).
– We continue to see load on the CMS Castor instance that has led intermittent failures of the SAM tests of the SRMs – which in turn has led to poor availability for CMS.

Services:
– We are intervening today on a faulty connection used for the iSCSI connection between some windows hypervisors and their disk subsystem.
– The site-bdii services are being put behind a pair of load balancers.
– Changes are being made to the publishing of CPU resources from the CEs to try and correct the totals.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
There has been no MB meeting and the one scheduled for 17.01.17 is cancelled.

SI-8 External Contexts (PC)
———————————
No report.

REVIEW OF ACTIONS
=================
610.1: AS/GS Produce suggestions for one or more metrics that will summarise the Tier-1 network availability/performance. Ongoing.
616.3: DB and SL will discuss how best to progress replacement of TW’s role. Ongoing.
617.3: RJ will establish a priority order for resources to address issues arising. Done.
617.4: JC will document what sites and periods CPU is idle and could be used elsewhere and will summarise in an email to the PMB. Done.
617.5: PG will discuss with Ulrich requirements for GANGA going forward and report back to the PMB. Done.
617.6: TC will discuss with Romain to consider submitting an abstract for CyberUK 2017. Done.
617.7: SL will look into possible saturation at 10% level for LHCBo jobs and determine if more resources should be allocated. Done.
619.1: DB will update Tony Medland on the planned HW spend at £554K. Done
619.2: DC will respond to PG this week on procurement plans for this financial year. Done
619.3: PG will respond to Alastair Dewhurst’s email on Tier-1 pledges. Done.

ACTIONS AS OF 16.01.17
======================
610.1: AS/GS Produce suggestions for one or more metrics that will summarise the Tier-1 network availability/performance. Ongoing.
616.3: DB and SL will discuss how best to progress replacement of TW’s role. Ongoing
620.1: DB to discuss communication in the event of an incident with DK.
620.2: SL will write job descriptions (2 x 0.5 FTEs) to replace Tom.