GridPP PMB Meeting 642

GridPP PMB Meeting 642 (14/08/17)
=================================
Present: Dave Britton(Chair), Tony Doyle, Roger Jones, Steve Lloyd, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Pete Clarke, Pete Gronbech, Dave Kelsey, Andrew McNab,
Tony Cass, Jeremy Coles, David Colling, Andrew Sansum.

1. Update on the National e-Infrastructure/UK-T0 Bid (Status)
=============================================================
There have been no recent updates in this regard.

2. GridPP39 – Agenda
====================
AM suggests a Singularity talk – GS confirmed there is no-one at RAL running Singularity, AM can update. CMS are interested in this. Singularity is run by Fermilab and is a way of running containers effectively, very lightweight and tailored towards HEP. It is a topical subject that will be covered at ACAT next week.
DB suggests Anthony Davenport to give the first talk to engage him and hear his thoughts on GridPP – re the vision of STFC, of infrastructure and its relevance to GridPP.
SL suggests GPUs and how to couch them. Phil Clarke, Edinburgh, did some work on this a while ago; Manchester may have some; QMU have some VOs – Dan could perhaps provide some insights.
PC suggests we may give a standing invitation to Dirac and SKA representatives (e.g. Rosie and Paul). This seems sensible and workable so long as open dialogue remains central to the meeting. There is a new director of Dirac and this seems an opportune moment to extend an invitation to him (Mark Wilkinson)
DB invited suggestions, e.g. Tier-1 topics. GS suggested that Echo and others may offer good topics as there is now operational experience to discuss. Echo talks have been given a few times, but we still need to know the status – decisions need to be taken on Tier-1 talks that can be covered here. GS and AS will discuss what might usefully be covered.
Atlas – RJ will give some thoughts to useful talks. The Network Forward Look document has been worked on and put to the side temporarily. It may be useful to summarise this and ensure it converges soon after the UKT0 submission at the end August.
GS suggests LHC status – this has been up and running for a while now and it would be useful to understand the current situation. RJ may look into a giving a short talk or asking someone to give a presentation from the Physics perspective.
LC to check – if David Salmond and Duncan Rand are on UKHEP and if not invite them to join.
DB is keen to provide as much input as possible so that PG can work up an appropriate agenda and asked for these to be sent on asap.
ACTION 642.1: DB will extend an invitation to the new Dirac director and a representative of SKA to attend GridPP39.
ACTION 642.2: AS & GS will discuss Tier-1 talks that could be included in agenda for GridPP39.
ACTION 642.3: LC – if David Salmond and Duncan Rand are on UKHEP and if not invite them to join
ACTION 642.4: RJ will consider who to give a talk on LHC and CERN current status.

3. AOCB
=======
a) Date of next PMB – Monday 4th September at 1pm appears to be the next most appropriate date for availability. Thereafter F2F on 13th September and following that on 25th September and a return to normal weekly meetings.

4. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
No report submitted.

SI-1 Dissemination Report (SL)
——————————
STFC were happy with the report, this is now with the administrators who have not yet submitted any comment. Imperial may need to initiate something in this regard. There is some question on whether VAT is payable – DC is currently checking this.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing significant to report.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
No report submitted.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
No report submitted.

SI-5 Production Manager’s report (JC)
————————————-
No report submitted.

SI-6 Tier-1 Manager’s Report (GS)
———————————
Infrastructure:
—————–
– Power work was carried out in building R26 (the Atlas building) over the weekend of 29/30 July as planned. This had no impact on our operational services due to the mitigations put in place. (These were migration of VMs to hypervisors in building R89, An alternative power supply being put in place for the Castor standby databases).

– There was a successful UPS/Generator load test this morning (9th Aug). These are done quarterly and this was the first regular test since the building UPS was replaced. (It had been tested shortly after installation).

Castor:
———
– There was a problem with the Atlas Castor SRMs in the early hours of Saturday 5th August. For reasons not understood there was an increase in the query rate to the SRMs from Atlas work. This overwhelmed the SRMs. After some work by both the database and Castor on-call staff an outage was declared for Atlas in the GOC DB. Once the load had reduced the SRMs were able to recover and the services returned to normal. It is possible the problem was related to the small number of (old) disk servers in the AtlasScratch pool causing poor performance for Castor. On the 7th August this pool was merged into the larger AtlasDataDisk pool and this may reduce the chance of this problem recurring.

– There was a problem with Atlas Castor during the afternoon / early evening of Thursday 10th August. The symptoms appeared different to that of the previous problem (above). Atlas Castor was restarted and the problem went away during the evening. However, the cause is not understood.

– We continue to see poor CMS availability with sporadic test failures – usually timeouts.

Much of Atlas has moved to Echo and DB questioned if this may take pressure off Castor, there is no apparent easing off as yet and GS will looking into this and may include this in the GridPP39 agenda. Echo storage is being used in addition to existing storage.

Services:
———–
– All squid nodes are now IPv4/IPv6 dual stack.
– “Test” FTS3 instance (used by Atlas) updated to 3.6.10 (emergency update – as this was the first server to reach 2 billion transfers. Due to an internal 32-bit integer being used it completely stopped working at this point.)

Echo:
——
The planned increases in the number of placement groups in the Echo CEPH Atlas pool has been completed. The remaining third of the 2015 storage purchases have been placed into Echo and the process of moving data so that use is made of this hardware has been started.

Networking:
—————
– There was a network break during the morning of Wednesday 2nd August. Unfortunately coinciding with staff being at a divisional meeting. There had been a problem with one of the RAL core network stacks on the 25th July. We had set our router pair (the Extreme X670s) to not flip back to use the link to this failing stack. However, during work to resolve the problem on the failed core stack our second link to another core stack went down – it appears our routers thought there was a network loop. This caused the Extreme x670 router pair to try switching back to the other connection. The upshot was a complete break in Tier1 connectivity to the core for around an hour. All network systems have since been fully restored and the fail-over configuration returned to its normal state. There was some delay in re-establishing IPv6 connectivity.
– Usage of the OPN has been high – with traffic exceeding the previous 20Gbit limit for some hours overnight last Tuesday – Wednesday. (See attached plot). On Wednesday (16th) one of the links will be moved to improve resiliency by moving the route taken by one of the links. (Technically: move the LHC-2 10GE circuit from RAL to CERN from Harbour Exchange to Powergate).

Hardware:
————
– For the last purchase of capacity hardware:
– CPU testing is basically done. CPU benchmarking is done for various scenarios (SL7, SL7 with SL6 containers, SL7 with SL7 containers). Power benchmarking to be done before the systems are ready for production use.
– Disk server testing has been OK. The re-configuring of the systems to use the appropriate multipath access to the disks has now been done and configured in Quattor. The servers still need testing with this multipath configuration in place.

DB questioned benchmarking (whether you need to benchmark all machines or just a sample) – GS is looking at this and will advise.

Here are the availability figures for July 2017 for the RAL Tier1.
—————————————————————————-
These figures were:
Alice: 100%
Atlas: 95%
CMS: 91%
LHCb: 99%
OPS: 99.2%
(I have again included the OPS availability figures although these are not in the WLCG reports.)

Comments:
All: Network problem on 25th July affected all VOs (including OPS).
Atlas – There have been a few instances of problems with the Atlas SRMs with a few days showing notably bad availability (E.g. 24th July when availability was only 26%).
CMS – Grumbly all through the month with sporadic failures timeouts in Castor responses. In addition particular problems on Sunday 16th when oncall staff intervened. The back-end database reported locking sessions and hot-spotting of files was seen. Then for a few days around the 21st July CMSDIsk became full – causing further test failures.

It is not clear what is causing this grumbliness, it shows as failed tests, perhaps Castor is sometimes slow to respond to CMS sometimes.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
There has been no MB meeting.

SI-8 External Contexts (PC)
———————————
Nothing to report.

REVIEW OF ACTIONS
=================
638.1: PC will update the text on the Network Forward Look document for the forthcoming 2 years. Ongoing.
638.2: AS will check when equipment is due to become obsolete and investigate legal and manpower of donation to the African Data Centre for Bioinformatics and Medical Research. (Update: AS is looking into how this may impact Global Challenge Research Fund – GCRF – which would involve a cross-Council bid) Ongoing.
641.1: AS will email Tony Medland asking whether a decision can be advised before September on the availability of STFC funding before the procurement submission deadline in August. Ongoing.

ACTIONS AS OF 14.08.14
======================

638.1: PC will update the text on the Network Forward Look document for the forthcoming 2 years. Ongoing.
638.2: AS will check when equipment is due to become obsolete and investigate legal and manpower of donation to the African Data Centre for Bioinformatics and Medical Research. (Update: AS is looking into how this may impact Global Challenge Research Fund – GCRF – which would involve a cross-Council bid) Ongoing.
641.1: AS will email Tony Medland asking whether a decision can be advised before September on the availability of STFC funding before the procurement submission deadline in August. Ongoing.
642.1: DB will extend an invitation to the new Dirac director and a representative of SKA to attend GridPP39.
642.2: AS & GS will discuss Tier-1 talks that could be included in agenda for GridPP39.
642.3: LC – if David Salmond and Duncan Rand are on UKHEP and if not invite them to join
642.4: RJ will consider who to give a talk on LHC and CERN current status.