Tier1 Operations Report 2018-08-27

RAL Tier1 Operations Report for 27th August 2018

Review of Issues during the week 20th August to the 27th August 2018.

Two major outages of IPv6 this week which impacted GODDB and FTS (primarily effecting ATLAS). The first started Thursday evening (~17:30 23/8/18), but was not resolved until ~22:00 24/8/18. The same fault happened again at ~15:20 25/8/18 and was finally resolved ~22:00 that evening. DS identified a problem on router5 where the IPv6 routes were not being advertised properly and has added a static route to fix the issue.

Current operational status and issues

NTR

Resolved Castor Disk Server Issues

Machine	VO	DiskPool	dxtx	Comments
gdss771	LHCb	lhcbDst	d1t0	Back in production RO
gdss732	LHCb	lhcbDst	d1t0	Back in production RO

Ongoing Castor Disk Server Issues

Machine	VO	DiskPool	dxtx	Comments
gdss747	Atlas	atlasStripInput	d1t0	Currently in intervention.
gdss738	LHCb	lhcbDst	d1t0	Currently in intervention.

Limits on concurrent batch system jobs.

GROUP_CMS_LIMIT = 4000
GROUP_ATLAS_LIMIT = 8000

Notable Changes made since the last meeting.

None.

Entries in GOC DB starting since the last report.

Service	ID	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
25905	FTS	UNSCHEDULED	OUTAGE	24/8/18 14:30	24/8/18 18:45	-	IPv6 outage impacting FTS services
25912	GOCDB	UNSCHEDULED	OUTAGE	25/8/18 18:50	25/8/18 19:46	-	Ongoing IPv6 outage impacting GOCDB

Declared in the GOC DB

Service	ID	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
-	-	-	-	-	-	-	-

No ongoing downtime
No downtime scheduled in the GOCDB for next 2 weeks

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Castor:
- Update systems to use SL7 and configured by Quattor/Aquilon. (Tape servers done)
- Move to generic Castor headnodes.
Internal
- DNS servers will be rolled out within the Tier1 network.

Open

GGUS Tickets (Snapshot taken during morning of the meeting).

Request id	Affected vo	Status	Priority	Date of creation	Last update	Type of problem	Subject	Scope
136942	t2k.org	in progress	less urgent	29/08/2018	30/08/2018	File Transfer	Copying ONLINE_AND_NEARLINE files from RAL tape storage times out	EGI
136884	t2k.org	in progress	top priority	27/08/2018	29/08/2018	Data Management - generic	lcg-cr not working for t2k vo	EGI
136840	snoplus.snolab.ca	in progress	urgent	23/08/2018	29/08/2018	Other	Cannot upload files to LFN from Storage node	EGI
136757	mice	in progress	less urgent	17/08/2018	21/08/2018	Other	Missing lsc files for mice VO on lfc.gridpp.rl.ac.uk ?	EGI
136701	lhcb	in progress	very urgent	14/08/2018	24/08/2018	File Transfer	background of transfer errors	WLCG
136366	mice	in progress	less urgent	25/07/2018	20/08/2018	Local Batch System	Remove MICE Queue from RAL T1 Batch	EGI
136199	lhcb	in progress	very urgent	18/07/2018	07/08/2018	File Transfer	Lots of submitted transfers on RAL FTS	WLCG
136028	cms	in progress	top priority	10/07/2018	29/08/2018	CMS_AAA WAN Access	Issues reading files at T1_UK_RAL_Disk	WLCG
124876	ops	in progress	less urgent	07/11/2016	23/07/2018	Operations	[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk	EGI

GGUS Tickets Closed Last week

Request id	Affected vo	Status	Priority	Date of creation	Last update	Type of problem	Subject	Scope
136798	lhcb	verified	very urgent	21/08/2018	21/08/2018	File Transfer	the Request Handler does not allow forcing of the request UUID	WLCG
136665	cms	solved	urgent	11/08/2018	21/08/2018	CMS_SAM tests	T1_UK_RAL is down > 12h	WLCG
136358	cms	solved	urgent	25/07/2018	21/08/2018	CMS_Facilities	T1_UK_RAL WN-xrootd-access failure	WLCG
134685	dteam	closed	less urgent	23/04/2018	22/08/2018	Middleware	please upgrade perfsonar host(s) at RAL-LCG2 to CentOS7	EGI

Availability Report

Target Availability for each site is 97.0%

Red <90%

Orange <97%

Day	Atlas	Atlas-Echo	CMS	LHCB	Alice	OPS
2018-08-20	100	100	100	100	100	100
2018-08-21	100	100	100	100	100	100
2018-08-22	100	100	100	100	100	100
2018-08-23	100	100	98	100	100	100
2018-08-24	100	100	99	100	100	100
2018-08-25	100	100	100	100	100	100
2018-08-26	100	100	100	100	100	100
2018-08-27	100	100	100	100	100	100

Hammercloud Test Report

Target Availability for each site is 97.0%

Red <90%

Orange <97%

Day	Atlas HC	CMS HC
2018-08-13	0	0
2018-08-14	0	0
2018-08-15	0	0
2018-08-16	0	0
2018-08-17	76	60
2018-08-18	100	100
2018-08-19	100	100
2018-08-20	100	100

Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud

Notes from Meeting.

The recent problems with Echo were discussed. A summary is also in the Operations report. Several points:
Current situation is that we have “production” access running
Atlas and CMS have upper limits on batch jobs about equal to pledge.
The additional memory for the Dell storage nodes has arrived. This will be added – at least initially in a rolling upgrade. However, it is expected this may take up to something like 2 weeks. (At which point we expect to be able to give full access to Echo). Discussions on the best way to do the memory upgrades is ongoing.
Before the problem we had noted that the LHCb files came from Castor in a “bursty” way. It was suggested last week that we limit the FTS to smooth out this burstiness.
When discussing GGUS tickets: LHCb are seeing some file transfer failures between worker nodes and Castor. This to be escalated.
The requirement for us to be able to easily contact all users of Echo in the event of a problem was noted.

Tier1 Operations Report 2018-08-27

RAL Tier1 Operations Report for 27th August 2018

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools