GridPP PMB Meeting 641

GridPP PMB Meeting 641 (31.07.17)
=================================
Present: Dave Britton(Chair), Tony Cass, Jeremy Coles, David Colling, Tony Doyle, Pete Gronbech, Steve Lloyd, Andrew McNab, Andrew Sansum, Louisa Campbell (Minutes).

Apologies: Pete Clarke, Gareth Smith, Roger Jones, Dave Kelsey.

1. NEI Survey
=============
PG made an entry for the whole of GridPP project this morning. Many questions do not fit our model so the format is not ideal but should suffice. AS noted they also submitted a response and a set of spreadsheets for services including Tier1, Apel and GOK.

2. National e-Infrastructure Bid (Status)
==========================================
AS noted STFC Capital £1.5M available and a sub-group of UKTO consortium have met in London to discuss submitting a collaborative grant application, there is good appetite for the various projects to work together. This will be challenging as the proposal has to be written for a common UKT0 project by about 22 August and both AS and PC are on holiday. AS is writing up on the Cloud infrastructure for Disk and CPU as well as possible consultancy on openstack and software for remote submission. Several sites are included but this is for hardware and there are no resource to run the kit so we must build on top of existing research and development projects – ie it can be dropped at RAL into Cloud and object store but that may be more challenging at other sites (Manchester can operate in a similar way), Cambridge don’t want hardware money but plan to contract software development, Edinburgh are currently considering potential involvement. Thus, the shape and parameters of the potential project is coming together well and another group call will be made this afternoon to firm up aspects. Grant starts December 2017 and ends March 2018 – probably to deliver hardware and possibly for consultation on software. AS reiterated this is an exceptional opportunity for collaborations and if we can capture hardware and spread this out it can lead to future collaborations and longer term community vision and activity.

3. Capital call for Enabling e-infrastructure (Status)
==========================================================
There is also other activity under Susan Morrell to submit request for large capital investment in time for the autumn statement – the NEI survey may be contributing to that in case this new money would become available across RCUK.

4. new Steve Lloyd test page
============================
SL advised his tests were based on running Atlas jobs but Atlas now runs more monitoring that he does. Thus, Steve has revamped his test to focus on monitoring Dirac (link is on the Agenda). All feedback or suggestions for what should or should not be included is welcome. DB asked for clarification on percentage figures in blue – SL is sending jobs and Dirac decides where they run so the percentage is a ranking of where jobs are running. The figures reflect the amount of spare capacity, DC noted it was not previously clear how Dirac distributed jobs and SL will put together a graph tracking over the last few months. SL noted a wider and more even distribution of jobs than previously. DC suggested it would be useful to see how that matches what is used at Imperial.

5. Tier-1 Procurement status/questions
======================================
Last week the PMB discussed whether we should decide to spend money previously allocated to Tape on CPU. Looking at the current ramp-up curve for tape usage over the last 12 months and extrapolating there is no indication the experiments will use the amount of storage we pledged in 2016. Following the growth curve the amount of tape available should suffice for the remainder of this year. If we look at formal resource requests we would need to consider purchase of c. £317K extra tape. Disk and CPU tenders are soon going to go out and we need to state the upper spend limit and a decision must be taken on whether to incorporate the money currently set aside for tape. There is a balance to be struck, previously unused tape could be used next year but with LT08 consideration must be given to whether it’s preferable to spend on CPU now and leave money for tape later. AS also needs to email Tony to confirm capital available for procurement before September because of the associated factors re procurement deadlines.
DC noted CMS is running short of tape and the extra that RAL was able to make available early was extremely useful, if Atlas are under-using theirs it may help here. Overall usage is coming out low which means we would not need to buy tape early, but if CMS requires it early then monies would need to be used to procure – perhaps it may be sensible to use up half this year to allocate to CMS in the short term. There was some discussion on the potential for rescheduling and longer term timescales for planning. Running the older generations of disk and CPU is causing some challenges – old kit needs regular upgrading and patching and is not running very efficiently. The resource allocation meeting will be 30 August.
ACTION 641.1: AS will email Tony Medland asking whether a decision can be advised before September on the availability of funding before the procurement submission deadline in August.

6. AOCB
=======
a) Thanks from LZ – Data Challenge. DB received a note of thanks from LZ with particular thanks to Daniela, Simon and Elena for their work on this.

7. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
Disbanded until September, nothing to report.

SI-1 Dissemination Report (SL)
——————————
Nothing of significance to report.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
RJ not in attendance, no report submitted.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Nothing to report.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
A couple of Queen Mary storage tickets that have been ongoing for some time, they are now being dealt with since being renewed last week after it had expired. Storm configuration update impacted this – this works for other VOs.

SI-5 Production Manager’s report (JC)
————————————-
It is a relatively quiet summer period for operations therefore only a few informative items this week:

1) Ian Neilson has now left GridPP. The security team will cover the security area including items such as the on-duty tasks. We thank Ian for all his valuable contributions.

2) ATLAS want to change Big Panda monitoring over to https, but this means only those with CERN SSO can access it. We can make arrangements but there has already been some short-term inconvenience.

3) RHUL has been added as another VAC site.

4) EGI WMS support will end in December 2017. We now only have one “light” user on the WMS as we started moving other regional VOs to DIRAC last year.

5) “LondonGrid” as a VO has been decommissioned. On Tuesday we will review the status of the other “Tier-2” VOs.

We are likely to have alternate weeks of ops meetings during August but will confirm after the ops meeting tomorrow.

SI-6 Tier-1 Manager’s Report (GS)
———————————
Castor:
———
– There have been several instances of problems with the Atlas SRMs (8th, 10th 14th July). Some of the SRM systems were found to be unresponsive. At least one instance of this correlated with a spike in the SRM request rate. However, the cause is not really understood.
– We still have some ongoing failures of the SAM tests against the CMS Castor instance. We have also had a period when CMSDisk was full which caused a significant failure of the SRM tests jobs.
– Following the network problems in the early hours of 25th July (see below) there was a problem with Castor file transfers for those transfers initiated by the CERN FTS3 service. It turned out that CERN had updated their FTS service at around the time the network problem at RAL was fixed. CERN reverted their change at the end of Wednesday afternoon (26th) and transfers again moved successfully.

Services:
———–
Access to the RAL FTS3 service via the SOAP interface has been blocked (by removing the fts3-prod-soap proxy from the load balancers). This will enable the upgrade of the FTS3 service to the latest version which no longer supports this interface.

Echo:
——
– The number of placement groups in the Echo CEPH Atlas pool continues to be increased. This is in preparation for the increase in storage capacity when new hardware is added.
– A test gateway to Echo ceph-test-gw691.gridpp.rl.ac.uk) has been made dual stack and test transfers have been shown to work over IPv6 to/from CERN.
– The Echo team is working with CMS and LHCb to enable/progress their testing of access to Echo.

Networking:
—————
– There was a site networking problem in the early hours of Tuesday morning, 25th July. One of the site core stacks stopped working correctly. The Tier1 core network connects via two routers into two of the site core network stacks to give resilience via a failover. Overnight the connection flipped between the connections to the two core stacks several times. However, it appears that the one failing stack was in a bad shape and even when nominally up was not working correctly. This failing stack was stopped in the morning which restored network connectivity and later the Tier1 router pair were set to run only through the good second stack. At the time of writing this (Thursday 27th) the central networking team await input from the vendors before intervening further on the problematic switch/router stack.
– We are tracking the ongoing problem with the site firewall that affects data flows.

Hardware:
————
– Firmware updates in OCF 14 disk server RAID cards was carried out during the week 17-21 July. This was in response to a problem where the RAID cards were flagging up disks as faulty which the vendor subsequently found were OK.
– For the last purchase of capacity hardware:
– CPU testing is basically done. Benchmarking should take place this coming week.
– Disk server testing has been OK. However, these systems need reconfiguring to make full use of the multipathing before further testing and ahead of going into service.

Here are the availability figures for June 2017 for the RAL Tier1.
————————————————————————
These figures were:
Alice: 100%
Atlas: 98%
CMS: 95%
LHCb: 100%
OPS: 100%
(I have again included the OPS availability figures although these are not in the WLCG reports.)

Comments:
Atlas – Problems 22/23 June with the SRMs. One of the processes kept crashing/restarting. Looks like the handling of double-slashes (“//”) in the filenames in incoming requests. Fixup applied to the SRMs.
CMS – Grumbly all through the month with sporadic failures timeouts in Castor responses. In addition particular problems on 24/25 June when CMSDisk became full causing the tests to fail.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
No meeting, nothing to report.

SI-8 External Contexts (PC)
———————————
Nothing to report.

REVIEW OF ACTIONS
=================
638.1: PC will update the text on the Network Forward Look document for the forthcoming 2 years. Ongoing.
638.2: AS will check when equipment is due to become obsolete and investigate legal and manpower of donation to the African Data Centre for Bioinformatics and Medical Research. (Update: AS is looking into how this may impact Global Challenge Research Fund – GCRF – which would involve a cross-Council bid) Ongoing.
640.1: PG to collage numbers for NEI Survey before the deadline of 31 July and confirm by email. Done

ACTIONS AS OF 31.07.17
======================
638.1: PC will update the text on the Network Forward Look document for the forthcoming 2 years. Ongoing.
638.2: AS will check when equipment is due to become obsolete and investigate legal and manpower of donation to the African Data Centre for Bioinformatics and Medical Research. (Update: AS is looking into how this may impact Global Challenge Research Fund – GCRF – which would involve a cross-Council bid) Ongoing.
641.1: AS will email Tony Medland asking whether a decision can be advised before September on the availability of STFC funding before the procurement submission deadline in August.