Difference between revisions of "Tier1 Operations Report 2019-01-28"

Latest revision as of 12:56, 30 January 2019

RAL Tier1 Operations Report for 28th January 2019

Review of Issues during the week 21st January 2019 to the 28th January 2019.

CPU Efficiencies are looking bad for ATLAS and CMS. This appears to be a global problem for CMS (i.e. all sites have very poor efficiency). The Liaisons have been taksed to investigate.
Some CMS GridFTP errors due to “Address already in use” problem. This was due to new hardware being put into production missing the fix that had been applied to the old machines. This was quickly resolved (intermittent errors for ~24 hours).
A disk server in Castor for LHCb ran into problems over the weekend and had to be removed from production while the disk array is being rebuilt. Some LHCb files are temporarily unavailable (although they are in Echo so if the LHCb fail-over mechanism is working, there should be no failed jobs!).
CMS submitted a GGUS over the weekend due to intermittent SAM failures connecting to Castor. Under investigation. UPDATE 29/1/2019: This was resolved PM 28/1/2019

Current operational status and issues

NTR

Resolved Castor Disk Server Issues

Machine	VO	DiskPool	dxtx	Comments
gdss739	LHCb	LHCb_FAILOVER,LHCb-Disk	d1t0	-

Ongoing Castor Disk Server Issues

Machine	VO	DiskPool	dxtx	Comments
-	-	-	-	-

Limits on concurrent batch system jobs.

ALICE - 1000

Notable Changes made since the last meeting.

NTR

Entries in GOC DB starting since the last report.

Service	ID	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
-	-	-	-	-	-	-	-

Declared in the GOC DB

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
-	-	-	-	-	-	-

No ongoing downtime

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

DNS servers will be rolled out within the Tier1 network.

Open

GGUS Tickets (Snapshot taken during morning of the meeting).

Request id	Affected vo	Status	Priority	Date of creation	Last update	Type of problem	Subject	Scope
139380	cms	in progress	urgent	29/01/2019	30/01/2019	CMS_Facilities	T1_UK_RAL failing SAM tests inside Singularity	WLCG
139375	atlas	in progress	urgent	29/01/2019	29/01/2019	Other	RAL-LCG2 transfers fail with "the server responded with an error 500"	WLCG
139306	dteam	in progress	less urgent	24/01/2019	29/01/2019	Monitoring	perfsonar hosts need updating	EGI
138891	ops	on hold	less urgent	17/12/2018	16/01/2019	Operations	[Rod Dashboard] Issue detected : egi.eu.lowAvailability-/RAL-LCG2@RAL-LCG2_Availability	EGI
138665	mice	in progress	urgent	04/12/2018	11/01/2019	Middleware	Problem accessing LFC at RAL	EGI
138500	cms	in progress	urgent	26/11/2018	28/01/2019	CMS_Data Transfers	Transfers failing from T2_PL_Swierk to RAL	WLCG
138361	t2k.org	in progress	less urgent	19/11/2018	28/01/2019	Other	RAL-LCG2: t2k.org LFC to DFC transition	EGI
138033	atlas	in progress	urgent	01/11/2018	25/01/2019	Other	singularity jobs failing at RAL	EGI
137897	enmr.eu	waiting for reply	urgent	23/10/2018	29/01/2019	Workload Management	enmr.eu accounting at RAL	EGI

GGUS Tickets Closed Last week

Request id	Affected vo	Status	Priority	Date of creation	Last update	Type of problem	Subject	Scope
139328	cms	solved	urgent	25/01/2019	29/01/2019	CMS_Facilities	T1_UK_RAL SRM tests failing	WLCG
139312	cms	solved	urgent	25/01/2019	29/01/2019	CMS_Data Transfers	Corrupted files at RAL_Buffer?	WLCG
139302	atlas	solved	urgent	24/01/2019	25/01/2019	File Transfer	RAL: transfer issues between BNL and UK due to a wrong DNS alias?	WLCG
139245	cms	solved	urgent	21/01/2019	21/01/2019	CMS_Data Transfers	Transfers failing from CNAF_Disk to RAL_Buffer	WLCG
139210	cms	solved	urgent	17/01/2019	25/01/2019	CMS_Data Transfers	Transfers failing from CSCS to UCL - issue with RAL FTS	WLCG
139209	cms	solved	urgent	17/01/2019	25/01/2019	CMS_AAA WAN Access	file open error at RAL	WLCG
139108	ops	verified	less urgent	09/01/2019	21/01/2019	Operations	[Rod Dashboard] Issue detected : org.nordugrid.ARC-CE-ARIS@arc-ce04.gridpp.rl.ac.uk	EGI

Availability Report

Day	Atlas	Atlas-Echo	CMS	LHCB	Alice	OPS	Comments
2019-01-22	100	100	93	100	100	0	Ref GGUS#138891
2019-01-23	100	100	92	100	100	-1	Ref GGUS#138891
2019-01-24	100	100	100	100	100	-1	Ref GGUS#138891
2019-01-25	100	100	75	100	100	0	Ref GGUS#138891
2019-01-26	100	100	42	100	100	0	Ref GGUS#138891
2019-01-27	100	100	57	100	100	0	Ref GGUS#138891
2019-01-28	100	100	72	100	100	0	Ref GGUS#138891
2019-01-29	100	100	97	100	100	-1	Ref GGUS#138891
2019-01-30	100	100	99	100	100	-1	Ref GGUS#138891

Hammercloud Test Report

Target Availability for each site is 97.0%

Red <90%

Orange <97%

Day	Atlas HC	CMS HC
2019-01-23	100	98
2019-01-24	100	98
2019-01-25	100	98
2019-01-26	100	91
2019-01-27	100	97
2019-01-28	100	93
2019-01-29	100	98

Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud

Notes from Meeting.

@@ Line 10: / Line 10: @@
 | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 21st January 2019 to the 28th January 2019.
 |}
-* Embarrassingly it's still business as usual at Tier-1.
+* CPU Efficiencies are looking bad for ATLAS and CMS.  This appears to be a global problem for CMS (i.e. all sites have very poor efficiency).  The Liaisons have been taksed to investigate.
-* Last weeks ARC-CE issues are steadily being resolved.  This includes the creation of a new ARC-CE that is hoped to be in production by the end of this week.
+* Some CMS GridFTP errors due to “Address already in use” problem.  This was due to new hardware being put into production missing the fix that had been applied to the old machines.  This was quickly resolved (intermittent errors for ~24 hours).
-* We did experience a network issue over the weekend that impacted cvmfs. As such as we took a hit on batch farm CPU efficiencies.  The issue was  resolved but as the efficiencies are calculated on the completion of jobs it will take a couple of days for the efficiencies to be back up to normal.
+* A disk server in Castor for LHCb ran into problems over the weekend and had to be removed from production while the disk array is being rebuilt.  Some LHCb files are temporarily unavailable (although they are in Echo so if the LHCb fail-over mechanism is working, there should be no failed jobs!).
+* CMS submitted a GGUS over the weekend due to intermittent SAM failures connecting to Castor.  Under investigation. UPDATE 29/1/2019: This was resolved PM 28/1/2019
 <!-- ***********End Review of Issues during last week*********** ----->
 <!-- *********************************************************** ----->

Difference between revisions of "Tier1 Operations Report 2019-01-28"

Latest revision as of 12:56, 30 January 2019

RAL Tier1 Operations Report for 28th January 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools