GridPP PMB Meeting 823

GriddPP PMB Meeting 823

Present: David Britton (Chair), Andrew Sansum, David Colling, Davide Costanzo, Alastair Dewhurst, Tony Doyle, Katy Ellis, David Kelsey, Peter Gronbech, Jonathan Hays, Roger Jones, Andrew McNab, Sam Skipsey (Minutes), Jill Sambrook (Minutes)

Apologies: Tony Cass, Steve Lloyd, Peter Clarke

ITEMS

1) Update on RAL Network State [AD]

-AD sent an email on the 21st of June explaining update on RAL Network.
-The weekend’s network downtime (MacroSegmentation) went extremely well and the RAL network has proven to be pretty stable since.
-There were some minor issues including 4/5 hour period of DNS errors. This seemed to impact tier 1 more than tier 2.
-Some other minor issues were encountered, but overall the change was a big success and enabled a lot of things and issues to be unlocked.
-AD will have to look into DNS issues further. This is something that still needs to be understood.
-DB congratulated AD+AS on the planning and success of the downtime. Great news the team got through it with only a few issues.
-The Network is now a lot more secure and nothing major is planned moving forward for now.
-AS confirmed there was no more work planned in terms of Physical engineering/change for Macro Segmentation. All done and dusted now. It is worth highlighting the next stage, over the next 3 months, is to implement the macro segmentation -start gathering traffic flow information. Some of this work may cause some blips along the way and this work will be ongoing. Also, further work planned for 18/24 months further down the line.
AS stressed the need to factorise the network to stop causing such disruption with future updates/changes.

2) Status of Rob’s work on LHCb tickets [AD]

-AD presented some slides to the group.
LHCb “Vector Reads”.
⦁ Problem has always been that LHCb jobs that use direct I/O to Echo have higher failure rate.
General lack of error message (socket timeout)
⦁ On Friday Rob emailed round a 33 slide presentation.
Rob can finally replicate the problem on his (non Ceph) test set up at ECDF
XRootD does not always cope well with recovering from networking events.
⦁ Three possible causes:
XRootD gateways are (rapidly, repeatedly) crashing
Overloaded virtual networking interface
Packet loss
What can be done?
⦁ Rob has made a number of suggestions regarding tests/logging/monitoring
We will attempt to perform all of these
⦁ He has suggested we upgrade Docker to the latest mainline version
we have confirmed on Friday that we were already using it.
⦁ He has suggested running the gateways on CentOS8 to avoid known XRootD bugs in CentOS7.
This is straightforward to setup
Would need a period of testing as we can’t be sure everything else would “just work”.

DB commented that the work Rob has done is fantastic. This needs to be communicated to him and recognised. Formal thanks to Rob for all of this work.
DB also suggested we should ask Rob to come to the Wednesday Liaison meeting and add this as a standing item near the start of each meeting. This regular contact would be good and it would be great to keep communication going.
DB also suggested the need to rule out low-level packet loss, AD noted that this is not a problem on the new network (but might be a reason why things generally look more reliable there than on the old network?).

AM commented on the great work Rob has done and Alistair and his team working on all of these issues, but did highlight how much effort this has required to fix.
AD highlighted that Rob has helped make a lot of progress and the LHCb Liaison post has been missing for some time. More effort now needed in the RAL LHCb area.

3) GridPP48 Ambleside [SS]

32 people registered at the moment and numbers are continuing to rise the more we discuss the meeting.
Currently there is no registration deadline, but we will introduce a deadline of the 1st of August soon.
We have 1 confirmed interest in a family room and 1 for a double, if these are available.
Roger is going to speak with the venue again to discuss room options and prices.

The focus of the meeting will be to look forward. Focus on the future and GridPP7. We need to start prodding people for talks. Agenda items need to go to Sam please.
Roger confirmed a dinner for the collaboration meeting has now been booked for 50 people on the Thursday.

STANDING ITEMS

SI-0 Tier-1 Manager’s weekly report & Technical Meetings [AD]
Some updates discussed above. Additional information:
⦁ AD recruited Technical Meeting hosts + conveners
DC happy to chair Rucio meeting.
SS/Matt D/Duncan E happy to chair Perfsonar meeting.
AD to look into New OS meeting.
⦁ Procurements have been a little challenging.
CPU procurement has been a real issue, due to component inventory fluctuations.
Quite a significant budget still remains (£1.5m) and so working hard to resolve this.
DB suggested that storage procurement might end up being a slight priority over CPU. We can perhaps adjust the balance of spend slightly. AD and DB to have a half hour discussion about this on Friday afternoon if required.

SI-1 ATLAS Weekly Review and Plans [DCos]
There was a short term issue with database connectivity with CERN meaning we couldn’t run jobs in the UK. This was restored pretty quickly.
File loss reported at RHUL – being managed by the established processes. SS used this as a chance to produce the OSC-requested notes on such processes.
There is some downtime planned at Oxford and upgrades at Manchester this week.
Business as usual.

SI-2 CMS Weekly Review and Plans [DC/KE]
KE was at CERN last week for computing face to face meetings.

New monitoring written for Disk deletions. Going to share with Tier 1 this week. JW suggested moving from gsiftp -> webdav, which seems to have helped.

RAL have a masterclass next Monday and KE will be giving a computer talk at this events.

Writing PPD annual status report.

A few file transfers with some missing files reported. Investigating.

SI-3 LHCb Weekly Review and Plans [AM]
Nothing much to report beyond things mentioned at the beginning of the meeting.
There was an LHCb update at the weekend (workload system was restarted), and everything is back to normal again.

SI-4 Operations Meeting Report [SS,PG,PC] –
Moving to a fortnightly meeting for the summer.

SI-5 LCG Management Board Report of Issues [DB]
2 LHCb tickets were raised which were passed on to Alistair.
Follow up on CERN council meeting from June
Presentation from Roger – Lancaster bid for WLCG Workshop on 7-9 November. Possible Rucio workshop Thursday and Friday (10-11) following.

SI-6 External Contexts (eg NGI/EGI) [PC/JH]
NTR

REVIEW OF ACTIONS
800.5 AD – Arrange in person DIRAC/Rucio meeting at IC (Jan22). [on-going] – Plans to re-establish virtual meetings and then set up an in person meeting. DC to try and arrange a date.

818.5 – DB/SS Group to formalise VOs to be added to the approved list. Ongoing

822.1 – AD to provide update about sponsorship for GridPP48 Ambleside.
no progress, but will update.

822.2 – AM to send DB the ticket number for ongoing issues for LHCb group with a 1 paragraph summary. DB to raise at Liaison meeting on Wednesday (will send an update prior) and speak with AD to ensure there is some action on this – Complete

823.1 – AD to Invite Rob to the Wednesday Liaison meeting for a few weeks to discuss LHCb tickets and leave as a standing item at the start of the agenda for a few weeks.