GridPP PMB Meeting 615

GridPP PMB Meeting 615 (28.11.16)
=================================
Present: Dave Britton (Chair), Jeremy Coles, David Colling, Tony Doyle, Pete Gronbech (Minutes), Roger Jones, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith.

Apologies: Tony Cass, Dave Kelsey, Pete Clarke

1. OSC
======
OC meeting went fine, DB gave the talk.
There was not too much on the Project Management side to report as we are at the beginning of the project and there were more important things to discuss. There were several interruptions and questions –Capital/ Resource split is the largest issue. They impressed on TM the difficulties we had last year and now with BREXIT and increased resource requests from LHC. They also noted our 92% level funding for hardware and the extra demands related to EU-T0 which all combine to make it very difficult. It is not clear what the exchange rate will do so in some ways it could be better to buy now if the exchange rate gets worse, despite losing out on Moore’s law. It is possible there may be capital opportunities coming along following the Autumn statement.

2. Pledges
==========
We must enter the WLCG pledge by Wednesday. There is some freedom to alter the level if required. AS expressed some concern about the procurements, which could be delayed by SBS. It has just been discovered that a SCARF purchase which was thought to be well on the way has not yet gone out – late delivery does not mean no delivery but there are financial implications. DC suggested the experiments will not be too worried until the LHC start up in ~June? It is thought the pledge should be made regardless. The Committee made several good comments and noted all the things we note. (See PG notes)

3. Q Reps
=========
Almost complete.

4. Tier-2 HW grants & Status
============================
It is believed DIRAC will be paying us for the 5PB of tape from fy17 of the order of £30k per year. Anthony Davenport attended the meeting, he is the Sarah Verth on the DIRAC side of things. The problem statement of having to deal with Capital/Resource was reiterated and it was made clear to us that we had always managed to deal with it before and they expected us to manage again. Re Autumn statement: BEIS received some money but no clear guidance on where to spend it and they need to decide what to fund.

Our intention is to pledge 60% of the increase between the early and late LHC experiment requests.
This has delayed procurement so there may be a delayed deployment.

RJ noted smaller sites are encouraged to deploy CPU only, with a small disk cache. It was 400TB about a year ago. Medium sites may go towards federated storage but we want to look at this more carefully. If a site has a lot of CPU then obviously Atlas wants it to have Storage. These are Atlas statements, but there should be a GridPP policy and DB suggested we should examine all the sites and if we only want the large sites to have storage and the other sites to go CPU only then we should say so. RJ and DB will pick this up within Atlas. RJ will forward Simone’s statement to DC to disseminate to tb-support.

LSST case study, press office at Manchester. There has been a press release which has been picked up by some technical outlets.

A discussion about the future dissemination metrics took place.

ACTION 615.1: RJ and DB to discuss policy on CPU and storage with Atlas.

ACTION 615.2: RJ will forward Simone’s statement to DC to disseminate to tb-support.

5. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
Nothing of significance to report.

SI-1 Dissemination Report (SL)
——————————
##GridPP Engagement Officer Notes for PMB

### GridPP in the news – LSST case study

Thanks to the Press Office at the University of Manchester, the GridPP/LSST Press Release has been released [1] and picked up by a number of physics and tech outlets [2, 3, 4].

[1] http://www.manchester.ac.uk/discover/news/the-dark-universe/

[2] https://astronomynow.com/2016/11/26/grid-computing-to-tackle-the-mystery-of-the-dark-universe/

[3] http://phys.org/news/2016-11-brilliance-tackles-mystery-dark-universe.html

[4] http://www.nanowerk.com/news2/space/newsid=45181.php

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
RJ noted a bit of an issue with pilot factories which was detected this morning and is being worked on. There were some issues with reprocessing, RAL FTS got stuck and Andrew Lahiff was away so A Dewhurst has been trained up to be able to deal with this.

There was some minor file loss at Glasgow.

RALPP panda database config issue, now fixed.

ECHO at RAL passing Hammer Cloud tests.

Two more power cuts at Glasgow over the weekend – DB is in meetings about building new computer room.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Nothing significant to report from CMS.

AL will report on running CMS work in containers on various cloud providers.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
AM will attempt to ensure there is sufficient MC jobs to run over Christmas. AM will be doing a talk at RCUK tomorrow, mainly on SKA but using some LHCb tools. Also run jobs at datacentres.

SI-5 Production Manager’s report (JC)
————————————-
Operations related updates for your consumption:

1. Sites are taking a keen interest in CentOS7 and many are already starting to transition. Several GridPP sites are further supporting the WLCG work in this area – particularly ECDF and Brunel. On a related note, another review of SL5 status in GridPP is being undertaken to catch hidden nodes (e.g. storage headnodes)– EGI requested sites to move a while back but the next major deadline is when SL5 support ends in March
2. Regional monitoring of VOs will continue to be done with Nagios until March 2017 as no ARGO solution exists – a request has been escalated within EGI by us as this functionality is not high-priority in the ARGO development plans at the moment. The variable activity of small/other VOs requires us to look at the monitoring here again more closely.
3. A widening of the request for support of the weekly (ops) sites meeting raises the question about the core operations model. Effort is currently being used in areas such as GridPP DIRAC support, ticket reviewing, WN/UI tarball maintaining, regional on-duty activities, OS rollout/testing, networking and security (including coordination centres).
4. NGI_UK has 24 tickets at the time of writing. The trend is downwards having been at above 30 for many months. The UK remains one of the better performing regions as judged by EGI metrics (see attachment from the EGI Operations Management Board last week).
5. NGI_UK based core services (APEL/GOCDB) also have good overall performance over the last year. (See third attachment showing recent issues across EGI).

JC mentioned a last check to ensure all hidden nodes are no longer running SL5. JC raised the lack of support for VO Nagios beyond SL5. At EGI level. Kashif will run LSST VO monitoring on the VO Nagios until March 2017. Core Ops: as the number of support staff decreases it will be difficult to be fully involved in all the core tasks. DB suggested it could be a theme for the next GridPP meeting to look at what core activities.

SI-6 Tier-1 Manager’s Report (GS)
———————————
Castor:
– There was a problem on Tuesday/Wednesday of last week with the SRMs for the Castor ‘GEN’ (Alice plus non-LHC VOs) instance.
Updated host certificates on these nodes did not contain the various alias DNS names used to access the service blocking access via this route. Owing to lack of staff present GGUS tickets from both SNO+ and MICE were not picked up on Tuesday – the problem being fixed on Wednesday morning.
– Final four of the twelve additional servers (each 120TB) have been added to LHCb.
– As reported before the testing of Castor 2.1.15 is largely complete. Owing to staff availability this update will be carried out in the New Year, with the intention of completing it by the end of January.
– We are looking to merge smaller disk pools into larger ones for both LHCb and Atlas. We expect to do this in December. (Need confirmation from the two VOs before going ahead).

Tape:
– Migration of LHCb data from ‘C’ to ‘D’ tapes ongoing. Approaching the 30% mark with just over 700 out of the 1000 tapes still to do.

Services:
– Early last week Atlas reported a backlog of FTS transfers stuck in our FTS service – which was resolved by Alastair.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
AM reported on the change to the accounting system, not controversial. Some Data protection reason. John Gordon was happy with the proposed changes.

SI-8 External Contexts (PC)
———————————
Nothing of significance to report.

REVIEW OF ACTIONS
=================
605.1: DK will investigate costs and timescales of upgrading the OPN Link to 30 and report back to PMB. Ongoing.
610.1: AS/GS Produce suggestions for one or more metrics that will summarise the Tier-1 network availability/performance. Ongoing.
612.3: PG will determine which small sites can undertake procurement this FY. Ongoing.
613.1: AS will undertake a post mortem on CMS issues at Tier-1. (UPDATE: AS has pulled a lot of information together and will speak more to AL, he has good information from Chris Brew and will speak to Rob Appleyard about CASTOR). Ongoing.
613.3: PG will create slides containing tables and notes for information with resource, capital and financials. Done.
613.4: PG will write to sites to provide instructions and request they commence procurement of minimum £10K for capital. Done.
613.5: ALL submit Q3 reports to PG. (Update: most now submitted, others will be submitted this week) Ongoing.
614.1: DB will finalise OSC slides and forward to AS for figures update then circulate to PMB. Done.
614.2: DC will liaise with various Operational team members on the CMS issue at Tier-1 to ascertain when the issue was identified and by what mechanism it was escalated. Ongoing.
614.3: PG will follow up with a further reminder to Durham, Brunel and Bristol on submission of Tier-2 HW grants and email STFC to check if all is in order. Done.
614.4: DB will enter pledged amounts into REBUS then check before formal submission. Ongoing.

614.5: PG will advertise the pledges summary on the website to the Ops team and ask them to highlight any issues with the proposed numbers for each site. (Updae: PG to send latest to SL email CB and TB support). Ongoing.

614.6: AS will email DC regarding pledges and availability. Done.

614.7: JC will contact Ian Collier to discuss potential date clashes between HEPSYSMAN and WLCG meetings. Ongoing.

ACTIONS AS OF 28.11.16
======================
605.1: DK will investigate costs and timescales of upgrading the OPN Link to 30 and report back to PMB. Ongoing.
610.1: AS/GS Produce suggestions for one or more metrics that will summarise the Tier-1 network availability/performance. Ongoing.
612.3: PG will determine which small sites can undertake procurement this FY. Ongoing.
613.1: AS will undertake a post mortem on CMS issues at Tier-1. (UPDATE: AS has pulled a lot of information together and will speak more to AL, he has good information from Chris Brew and will speak to Rob Appleyard about CASTOR). Ongoing.
613.5: ALL submit Q3 reports to PG. (Update: most now submitted, others will be submitted this week) Ongoing.
614.2: DC will liaise with various Operational team members on the CMS issue at Tier-1 to ascertain when the issue was identified and by what mechanism it was escalated. Ongoing.
614.4: DB will enter pledged amounts into REBUS then check before formal submission. Ongoing.

614.5: PG will advertise the pledges summary on the website to the Ops team and ask them to highlight any issues with the proposed numbers for each site. (Updae: PG to send latest to SL email CB and TB support). Ongoing.

614.7: JC will contact Ian Collier to discuss potential date clashes between HEPSYSMAN and WLCG meetings. Ongoing.