GridPP PMB Meeting 687

GridPP PMB Meeting 687 (19.11.18)
=================================
Present: Dave Britton (Chair), Tony Cass, Pete Clarke, David Colling, Jeremy Coles, Alastair Dewhurst, Tony Doyle, Pete Gronbech, Roger Jones, Dave Kelsey. Steve Lloyd, Andrew McNab, Gareth Roy, Andrew Sansum, Louisa Campbell (Minutes).

Apologies:

1. Tier-1 Purchase Update
=========================
AD summarised his recent email on Tier-1 purchase and procurement issues referred to a 6 week delay of tenders due to a legal challenge STFC are dealing with based on transparency issues, recently resolved and resulted in updates in tendering guidelines. This is further impacted by recent US tariffs on China which impact up to 25% knock-on. At the moment Disk is in and live and AD believes the contract should be awarded before Christmas for delivery mid-February, this may be impacted by the tariffs but will be delivered with leeway to mitigate any potential delays. CPU tender has not been submitted since it is customary for a 6-week delay to impact these which should be avoided if possible. Two options are available –
a) Direct award (XMA at top of direct award pile – a quote has been requested for the same equipment as last year which should soon be available); or
b) Mini-tender (remove all reference to benchmarking and specify types of machine, e.g. memory, disk, CPU) which allows more suppliers to bid.
The PMB have worked hard to avoid issues that could lead to procurement challenges and delays, but these matters are outside our control. It was agreed that direct award to XMA is the preferred route as the single tranche is not an issue since it is the same equipment as last year and avoids issues relating to technology/compatibility, etc. This is a small procurement and consideration should be given to whether two tranches are preferable for the future. Another issue on mini-tender and removing benchmarking relates to CPU technology with two risks: unknown price increases and, despite mini-tender being potentially quicker, there are risks to supply chain due to tariffs. AD will progress option 1 with all haste.

2. CRICK visit
=============
On Wednesday 21st November GridPP have been invited to speak about Grid computing – attendees are DB, PC, AS, AM, DC, Ian Collier. Agenda is to assess how they will undertake scientific computing in the future beyond the CRICK institute scale. This will provide opportunities for overviews and discussion for potential future collaboration. GridPP should talk for approx. 45/60 mins with multiple contributors – DB (intro), then PC (UK context and how GridPP fits into IRIS and STFC etc), AS (Tier-1 and SCD as key elements of GridPP and how they contribute to this and other projects), Ian Collier (WLCG), AM (UK context of how GridPP is distributed in the UK). DB will report back to the PMB after the meeting.

3. AOCB
=======
a) LSST:DESC
Strategies to bring other projects on board and offer assistance were discussed. There were recent talks for IRIS where EUCLID and LSST advised they were unable to use the Grid for various reasons. This should be addressed through a policy ensuring if users are invited to use the Grid we can deliver requirements. GridPP could consider a batch service outwith the ‘normal’ or alternatives. We should make clear what can be offered and invite participation, based on previous experiences. There was some discussion on support of Ganga and differences between GridPP and SCD relating to support of particle physics going forward. It was agreed the website user guides require updating regarding Ganga and there should be a technical post-mortem in retrospect on how we (SCD and GridPP) could have met LSST requirements and reporting structures – JC raised this with the PMB in July and PC will discuss with George at LSST.

DC confirmed his team are keen to work with Dirac on something less high profile to consider alternative solutions. It was suggested that if there is a time critical exercise planned on GridPP weekly meetings should be arranged with the relevant PMB member to monitor progress.

ACTION 687.1: PC will report to PMB after discussions with LSST.

b) OSC documents – PG circulated final versions to the PMB.

5. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
Rucio based meeting – AD is currently preparing minutes. This was a very useful meeting for various reasons, there are several projects at an early level that will be tracked.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
Centos migration has occurred at RAL. Harvester has migrated and related issues to job scheduling are ongoing. The Tier-1 report deals with these and other aspects.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
DC mentioned Global pool at CERN issues – moved to the Fermilab which has had a slight impact, but there is nothing UK specific to cover. DB asked about request for additional disk space, this has been actioned in the last week (CMS wrote to Echo which elevated them above last year’s pledge), DC will discuss at meetings on Thursday.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
PG noted a request to Tier-3 (very local batch clusters) for additional CPU re Monte Carlo, but this should not be actioned as we are already exploiting resources for this which AM summarised.

SI-4 Production Manager’s report (JC)
————————————-
T2K – delays where LHC became read-only are being addressed (now working) – AD confirmed the hope is to migrate 90% of the files this week and the remainder next week then T2K off the LHC (largest user).
EGI operations management board meeting – future of PDRI – the EGI community are discussing whether to continue supporting this. There is an indicative timeline of mid 2019.
Tier-2 availability/reliability. October figures were very good, no falls below 95%. Plugin to Rucio – DB suggested someone in the UK should consider this. An update in HEPIX network functions; and a talk on DERMA update talk caused some discussion; Authentication/authorisation was updated and the Atlas data carousel was also updated.

SI-5 Tier-1 Manager’s Report (AD)
———————————
– ClusterVision 17 disk servers were occasionally rebooting. This has been fixed with a firmware patch that was rolled out across last week. The hardware continues to be weighted up and we exceeded 1Tb/s aggregated throughput while this was happening (See plot!).

– On Thursday 15th November, all non-LHC VOs were migrated to the new consolidated Castor tape instance.

– CMS AAA issues are ongoing, although we may have finally turned a corner! We found that certain CMS requests do not specify the Ceph pool to use (in this case it should be “cms”). As we did not have a default pool set, these will fail. We found that a bug meant that any subsequent requests would be sent to the (non-set) default pool causing all transfers to fail. This required a restart of the XRootD process. For the CMS AAA service we have added a default pool and monitoring of the logs, which will restart the process should it detect that transfers are using a non-set pool. SAM test results were at 96% over the weekend.

– We had a meeting on Thursday 15th November to discuss the fact that ATLAS jobs won’t work at RAL via singularity. The Tier-1 will not be changing the settings to allow ATLAS jobs in their current state to work. We believe it is an unacceptable security risk (and is very inefficient as well). Other major sites (OSG + CERN) have come to the same conclusion. CMS are using singularity in production and do not have this issue. I will request that Tim Adye gives a talk at the next relevant ATLAS meeting and we will also be giving a talk at the December GDB (probably David Crooks), who will try and set general site agreement that the ATLAS use case should not be allowed. Resolution should be found through face-to-face discussion and email. Regarding the move from Docker to Singularity – it should be recognised that Atlas operate slightly differently for various reasons.

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
Nothing to report.

SI-7 External Contexts (PC)
———————————
Nothing to report.

REVIEW OF ACTIONS
=================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.
667.2 PG will do h/w planning before next OC to provide OC with details of shortfall in funds. (Update: PG will check the OSC minutes for details and cover with GR). Ongoing.
678.2: DK to finalise the Security, Trust and Identity background document by mid October. (Update: DK and David Crooks have been working on this and it is nearly complete) Done.
678.3: AD to finalise the Tier1 background document, including tape strategy by end September. (Update: Almost complete and will circulate current iteration for comment). Ongoing.
678.5: JC to finalise the Storage background document by end September.
(UPDATE: 17 October meeting with Tony Medland & DB and PC will attend. This is almost complete and awaiting a few minor elements to be worked in ñ GR will upload into Googledocs for info). Ongoing.

ACTIONS AS OF 19.11.18
======================

644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.
667.2 PG will do h/w planning before next OC to provide OC with details of shortfall in funds. (Update: PG will check the OSC minutes for details and cover with GR). Ongoing.
678.3: AD to finalise the Tier1 background document, including tape strategy by end September. (Update: Almost complete and will circulate current iteration for comment). Ongoing.
678.5: JC to finalise the Storage background document by end September.
(UPDATE: 17 October meeting with Tony Medland & DB and PC will attend. This is almost complete and awaiting a few minor elements to be worked in ñ GR will upload into Googledocs for info). Ongoing.
687.1: PC will report to PMB after discussions with LSST.