Difference between revisions of "Tier1 Operations Report 2019-02-25"

Revision as of 11:06, 25 February 2019

RAL Tier1 Operations Report for 25th February 2019

Review of Issues during the week 18th February 2019 to the 25th February 2019.

Castor disk server were physically moved to make room for new procurements. This was done in a rolling manner during Tuesday 19th February. The data on a disk server was unavailable while it was being moved, but relatively little impact on LHCb was observed.
We have had two Castor disk server crashes since the move gdss776 and gdss783 both lhcbDst disk servers.
All but one of the ARC CEs has been upgraded. We are observing significantly less load on the machines, which we believe was the cause of most of the other issues observed.
CPU efficiency is improving. There is an ongoing discussion with CMS regarding some of their jobs. Their most problematic jobs involve large input files (> 10GB) which have metadata spread across the file (this is being worked on by CMS).

Current operational status and issues

Resolved Castor Disk Server Issues

Machine	VO	DiskPool	dxtx	Comments
gdss811	LHCb	lhcbDst	d1t0	-

Ongoing Castor Disk Server Issues

Machine	VO	DiskPool	dxtx	Comments
gdss776	LHCb	lhcbDst	d1t0	-

Limits on concurrent batch system jobs.

ALICE - 1000

Notable Changes made since the last meeting.

NTR

Entries in GOC DB starting since the last report.

Service	ID	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
ARC-CE	26729	Yes	Outage	12/2/2019 16:30	19/2/2019 14:00	7 Days	ARC-CE draining and upgrade to ARC v5.4.3

Declared in the GOC DB

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
-	-	-	-	-	-	-

No ongoing downtime

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

DNS servers will be rolled out within the Tier1 network.

Open

GGUS Tickets (Snapshot taken during morning of the meeting).

Request id	Affected vo	Status	Priority	Date of creation	Last update	Type of problem	Subject	Scope
139723	atlas	waiting for reply	less urgent	15/02/2019	18/02/2019	Data Management - generic	permissions on scratchdisk	EGI
139672	other	in progress	less urgent	13/02/2019	19/02/2019	Middleware	No LIGO pilots running at RAL	EGI
139639	cms	waiting for reply	very urgent	12/02/2019	18/02/2019	CMS_AAA WAN Access	file open error at RAL	WLCG
139476	mice	in progress	less urgent	01/02/2019	06/02/2019	Other	LFC dump	EGI
139306	dteam	in progress	less urgent	24/01/2019	15/02/2019	Monitoring	perfsonar hosts need updating	EGI
138665	mice	on hold	urgent	04/12/2018	30/01/2019	Middleware	Problem accessing LFC at RAL	EGI
138500	cms	in progress	urgent	26/11/2018	12/02/2019	CMS_Data Transfers	Transfers failing from T2_PL_Swierk to RAL	WLCG
138361	t2k.org	in progress	less urgent	19/11/2018	31/01/2019	Other	RAL-LCG2: t2k.org LFC to DFC transition	EGI
138033	atlas	in progress	urgent	01/11/2018	31/01/2019	Other	singularity jobs failing at RAL	EGI
137897	enmr.eu	on hold	urgent	23/10/2018	31/01/2019	Workload Management	enmr.eu accounting at RAL	EGI

GGUS Tickets Closed Last week

Request id	Affected vo	Status	Priority	Date of creation	Last update	Type of problem	Subject	Scope
139575	cms	solved	urgent	07/02/2019	13/02/2019	CMS_AAA WAN Access	T1_UK_RAL SAM xrootd reads failing	WLCG
139477	ops	verified	less urgent	01/02/2019	13/02/2019	Operations	[Rod Dashboard] Issue detected : org.nordugrid.ARC-CE-submit-ops@arc-ce04.gridpp.rl.ac.uk	EGI
139380	cms	closed	urgent	29/01/2019	14/02/2019	CMS_Facilities	T1_UK_RAL failing SAM tests inside Singularity	WLCG
139375	atlas	closed	urgent	29/01/2019	18/02/2019	Other	RAL-LCG2 transfers fail with "the server responded with an error 500"	WLCG
139328	cms	closed	urgent	25/01/2019	12/02/2019	CMS_Facilities	T1_UK_RAL SRM tests failing	WLCG
139312	cms	closed	urgent	25/01/2019	12/02/2019	CMS_Data Transfers	Corrupted files at RAL_Buffer?	WLCG
139245	cms	closed	urgent	21/01/2019	18/02/2019	CMS_Data Transfers	Transfers failing from CNAF_Disk to RAL_Buffer	WLCG
138891	ops	verified	less urgent	17/12/2018	13/02/2019	Operations	[Rod Dashboard] Issue detected : egi.eu.lowAvailability-/RAL-LCG2@RAL-LCG2_Availability	EGI

Availability Report

Day	Atlas	Atlas-Echo	CMS	LHCB	Alice	OPS
2019-02-12	100	100	100	100	96	92
2019-02-13	100	100	100	100	100	52
2019-02-14	100	100	99	100	93	100
2019-02-15	100	100	100	100	91	100
2019-02-16	100	100	97	100	87	100
2019-02-17	100	100	100	100	91	100
2019-02-18	100	100	100	100	84	100
2019-02-19	83	74	79	100	94	100

Hammercloud Test Report

Target Availability for each site is 97.0%

Red <90%

Orange <97%

Day	Atlas HC	CMS HC
2019-01-23	100	98
2019-01-24	100	98
2019-01-25	100	98
2019-01-26	100	91
2019-01-27	100	97
2019-01-28	100	93
2019-01-29	100	98

Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud

Notes from Meeting.

@@ Line 10: / Line 10: @@
 | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 18th February 2019 to the 25th February 2019.
 |}
-* ARC-CE upgrades are continuing. ARC-CE04 and ARC-CE05 have now been upgraded and have been returned to production status. ARC-03 is currently being drained prior to upgrade and ARC-CE02 and ARC-CE01 will follow after that.
+* Castor disk server were physically moved to make room for new procurements.  This was done in a rolling manner during Tuesday 19th February.  The data on a disk server was unavailable while it was being moved, but relatively little impact on LHCb was observed.
-* Echo experienced several thousand  slow/blocked/stuck requests early Sunday morning (17/02/19).  There were a large number of call-outs for an approximately 1 hour period (starting ~01:30). It is currently suspected that there was internal RAL "network weirdness".  The investigation of this issue is continuing.
+* We have had two Castor disk server crashes since the move gdss776 and gdss783 both lhcbDst disk servers.
-* Over the weekend CASTOR Alice experienced a disproportionately large number of transfer requests from clients based at CERN.  Although CASTOR Alice  wobbled a bit it didn't die however performance was impacted. The investigation into this issue is ongoing.
+* All but one of the ARC CEs has been upgraded.   We are observing significantly less load on the machines, which we believe was the cause of most of the other issues observed.
+* CPU efficiency is improving.  There is an ongoing discussion with CMS regarding some of their jobs.  Their most problematic jobs involve large input files (> 10GB) which have metadata spread across the file (this is being worked on by CMS).
 <!-- ***********End Review of Issues during last week*********** ----->
 <!-- *********************************************************** ----->

Difference between revisions of "Tier1 Operations Report 2019-02-25"

Revision as of 11:06, 25 February 2019

RAL Tier1 Operations Report for 25th February 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools