GridPP PMB Meeting 645

GridPP PMB Meeting 645 (25.09.17)
=================================
Present: Dave Britton(Chair), Pete Clarke, Jeremy Coles, David Colling, Pete Gronbech, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Tony Doyle, Roger Jones, Dave Kelsey, Tony Cass,.

1. WLGC GriddPP Pledge levels 2018
==================================
Due end September, PG summarised:
Few changes on Viva page which PG has now checked. There has been a slight reduction since earlier in the year re global requirements for CPU and tape – for RAL this is reduced, which helps with the decreased funds and delivery can be targeted appropriately as discussed at the F2F. Pledge level has been based at 90% of uplift between the original plan and updated request – the figures have been sent to AS for comment. Attempts are made to match Tier2 pledge to Tier1 pledge and PG has used the usual spreadsheet to base this on the same criteria (meeting 90% of uplift) and confirms all sites should be able to meet this.
DB noted info should be circulated to CB and site admins. From last year the site most challenged in terms of disk was Liverpool and PG will check the figures are acceptable with them. Although pledges are only at Tier2 level, unforeseen issues should not be experienced elsewhere. PG will check with other sites where margins are closer and once Tier1 carries out a sanity check they can be loaded into Rebus. DB exchanged emails with Tony Medland who agreed the 90% target was acceptable – DB made clear we are working hard to meet these requirements. PG noted the Atlas tape requirements globally did not decrease as much as may have been anticipated.

2. Quarterly reporting
======================
PG noted reports are being received later each quarter – we are already approaching Q3 reporting times and a second reminder was circulated this morning. It was agreed this needs to be addressed though mitigating factors were recognised, e.g. JC requires receipt of all Tier2 reports and the August holiday is also a contributory factor. It was agreed this needs to be prioritised higher in schedules to meet quarterly deadlines. PG will restate schedule for quarterly reports to all contributors to clarify timescales – this can be raised at the next PMB to reinforce timescales and content. Members will submit outstanding reports asap.

3. AOCB
=======
a) It was agreed GridPP39 was successful. The good range of activities being reporting on were very encouraging and their presentation to Anthony Davenport and Mark Wilkinson was extremely helpful. Holloway picked up on DB’s reminder that Tier2s will continue to shrink re their role in Atlas – this was previously understood, the only change was to the allocation of Tier2 resources and continued use metrics where it makes sense to do so as discussed in GridPP38. No negative comments were received and no issues were raised.

4. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
First technical meeting post-summer is due to take place on Friday – DC will report next week.

SI-1 Dissemination Report (SL)
——————————
SL asked for feedback on whether payments are going through for Janus. DC is chasing this up and will confirm. The work is due to commence as soon as the funds begin to be received in the fraction being funded – it would be useful to discuss the parameters of tasks/milestones required. PG should be involved in identifying how the work is monitored over the project, also groups of people Janus should be working with should be pro-actively identified. It is possible some Dirac-related support may be required for NA62 on a relatively short timescale. The current plan is for him to work on some LZ, Solid data transfer and other projects where we have had direct contact. NA62 will involve Manchester and possibly Peter Love’s involvement.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
RJ was not in attendance, no report submitted.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Work on the CPU inefficiency is ongoing and active. CMS at Home project led by Ivan is receiving real workloads and should be part of the general production soon.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
Nothing significant to report.

SI-5 Production Manager’s report (JC)
————————————-
1. August A/R figures for Tier-2s are now final:

* ALICE: http://wlcg-sam.cern.ch/reports/2017/201708/wlcg/WLCG_All_Sites_ALICE_Aug2017.pdf

All okay.

* ATLAS: http://wlcg-sam.cern.ch/reports/2017/201708/wlcg/WLCG_All_Sites_ATLAS_Aug2017.pdf

All okay.

* CMS: http://wlcg-sam.cern.ch/reports/2017/201708/wlcg/WLCG_All_Sites_CMS_Aug2017.pdf

All okay.

* LHCb: http://wlcg-sam.cern.ch/reports/2017/201708/wlcg/WLCG_All_Sites_LHCB_Aug2017.pdf

Glasgow: 79%:79%.

Glasgow: The low availability for LHCb is associated with the site SRM (computational resources were 100% operational for that period). Intermittent failures were seen on the SRM for both the Argo and LHCb tests, but not CMS or Atlas. The failures coincide with the addition of some new storage.

2. CMS has been pushing on IPv6 availability at their sites. This is ahead of a WLCG operations coordination effort intended to direct all sites towards getting IPv6 readiness for 2018.

3. The future of the LFC is being raised as a potential issue for some of our supported VOs like T2K. We will continue discussions with them about other options (e.g. DFC).

4. There was a WLCG GDB at CERN on 13th September: https://indico.cern.ch/event/578990/. There was an update on the Community White Paper (https://indico.cern.ch/event/578990/contributions/2720743/attachments/1522280/2381911/CWP_Status_-_GDB_20170913.pdf) that may be of interest to many on the PMB.

5. Now running LHCb production MC in Docker and Singularity containers on VAC (since GridPP39 Andrew added CPU and memory cgroups wrappers around Singularity Containers).

SI-6 Tier-1 Manager’s Report (GS)
———————————
Castor:
———
– The high rate of SRM SAM failures for CMS stopped in mid-August although why this improvement came about is not understood.
– We have seen problems with the LHCb SRM SAM tests since the 8th September. However, LHCb are not seeing operational problems.
– The CMSDisk area in Castor became full over the weekend 9/10 September which caused SAM test failures. We will work with CMS to find the ‘dark data’ that is in Castor but not known to CMS and has contributed to this problem.
– Enabled access to Castor for the SOLID experiment.

Services:
———–
– The CVMFS Stratum-1 service (a 2-node High-Availability cluster) is now dual stack with IPv6.
– All remaining services have been migrated from our old Windows HyperV2008 infrastructure to hypervisors running HyperV2012. The HyperV2008 systems are being decommissioned.
– There was a problem with the LFC service. A new node was added to the alias but it was not configured to support dteam VO. This was corrected in response to a GGUS ticket.

Echo:
——
– On Tuesday (15th August) a new version of XRootD was installed on the Echo gateways (v20170724-06470a6).
– All the 2015 capacity storage is now in Echo giving it around 13.4Petabytes of space raw and approximately 10PBytes of usable space. Problems have arisen during the re-balancing its existing data across all the storage. Rebalancing is needed before the additional capacity can be used by the VOs. However, disk errors in the hardware added in has exposed a bug in the CEPH Erasure Coding when back-filling. Furthermore a problem in the Echo gateways which created many threads when a CEPH placement group is unavailable (as in this case) requires that the gateways be stopped while each problem is fixed. Although now resolved there was a loss of around 22,000 Atlas files (of which 3285 files were unique) when one CEPH “placement group” was lost. Since this incident the better understanding of CEPH problems within Echo has provided stability. However, careful management of the disks in servers recently added into Echo has been needed. This has slowed the rate at which the new hardware is being brought into full use.
– There has been a successful test transfer of CMS data into Echo using PhEDEx.
– Discussions are underway with ALICE about using Echo.

Networking:
—————
– This morning (16th Aug) one of the three links that make up the OPN connection was moved to a new circuit that uses a different route – improving resilience of the overall OPN link.
– It had been noted that for a weeks although we were transferring data inbound over all three links that make up the OPN connection we were only sending data over two. On Tuesday (22nd August) the OPN router was rebooted pick up a parameter change that corrected this.
– There have been three cases where the network link to one of the batches of worker nodes (Dell ’16) has dropped out and needed resetting. Firmware was updated in one of the switches last week (19th Sep) to try and fix this.

Hardware:
————
– 2016 capacity CPU in production.

Here are the availability figures for August 2017. Added here for completeness.

These figures were:
Alice: 100%
Atlas: 99%
CMS: 95%
LHCb: 100%
OPS: 99.6%

Comments:
Atlas – There was a problem with Atlas Castor during the afternoon / early evening of Thursday 10th August. Atlas Castor was restarted and the problem went away during the evening. However, the cause is not understood. We have previously had some problems with the Atlas Castor SRMs but the symptoms of this failure appeared different to those.
CMS – A rate of randomly times SRM test failures caused by timeouts up to the 14th August, then significantly better after this. However, we do not know why the problems went away on that date. These failures had been going on for some time leading up to August too. In addition particular problems on 24/25 June when CMSDisk became full causing the tests to fail. There was several months of CMS stability being poor but this got better in August at the same time of Andrew Lahiff’s testing to stress Castor – this was re-tested and seems to be completely coincidental.

Replacement for GS will be in place by mid-October and another replacement of staff has been confirmed.
AS confirmed Tier1 manager draft of JD is ready to be advertised. The Security post is being dealt with.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
There has not been a MB meeting.

SI-8 External Contexts (PC)
———————————
PC has received the referee reports for UKT0 bids and is dealing with responses to resulting questions along with others, including AS. The deadline is 3 October, but PC will attempt to respond by Friday due to other commitments. PC will send to DB to review and make comment.

REVIEW OF ACTIONS
=================
638.2: AS will check when equipment is due to become obsolete and investigate legal and manpower of donation to the African Data Centre for Bioinformatics and Medical Research. (Update: AS is looking into how this may impact Global Challenge Research Fund – GCRF – which would involve a cross-Council bid). Ongoing.
641.1: AS will email Tony Medland asking whether a decision can be advised before September on the availability of extra STFC funding before the procurement submission deadline in August. Done.
642.4: RJ will consider who to give a talk on LHC and CERN current status. Done.
643.1: DB will go over the Network Forward Look document to make corrections and clarify where necessary. Done.
644.1: DB will discuss Consolidated Grant funding and resources with Tony Medland. Ongoing.
644.2: PG and AS will document plans and costings for the remainder of GridPP5 taking account of the Oracle tape issues experienced. Ongoing

644.3: AS put together a starting plan for staff ramp-down. Ongoing.

644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. Ongoing.

644.5: DB will respond to Will Venters confirming the PMB’s agreement to participate in the LSE project. Done.

644.6: (during GridPP39 meeting) DB will ascertain what Biomed do and how they use the Grid. (Update: Holloway then Manchester and Imperial are biggest representatives – DC and AM will investigate this and ascertain whether our input is being explicitly credited). Ongoing.

Additional points – DC was looking at site availability and reliability – for the last quarter RAL was significantly less than others, JC confirmed LHCb saw the same thing. This needs to be investigated and further discussed at a later PMB meeting.

Next Monday DB is flying to CERN, PG will chair.

9 October – DB also may not be available and PG may chair.

ACTIONS AS OF 25.09.17
======================
638.2: AS will check when equipment is due to become obsolete and investigate legal and manpower of donation to the African Data Centre for Bioinformatics and Medical Research. (Update: AS is looking into how this may impact Global Challenge Research Fund – GCRF – which would involve a cross-Council bid). Ongoing.
644.1: DB will discuss Consolidated Grant funding and resources with Tony Medland. Ongoing.
644.2: PG and AS will document plans and costings for the remainder of GridPP5 taking account of the Oracle tape issues experienced. Ongoing

644.3: AS put together a starting plan for staff ramp-down. Ongoing.

644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. Ongoing.

644.6: (during GridPP39 meeting) DB will ascertain what Biomed do and how they use the Grid. (Update: Holloway then Manchester and Imperial are biggest representatives – DC and AM will investigate this and ascertain whether our input is being explicitly credited). Ongoing.