Difference between revisions of "Tier1 Operations Report 2019-07-03"

Latest revision as of 10:08, 3 July 2019

RAL Tier1 Operations Report for 3rd July 2019

Review of Issues during the week 26th June 2019 to the 3rd July 2019.

Frontier Service went down over weekend. Tier-1’s interpretation is that ATLAS were probably doing “something silly” with their US HPCs. The Production Manager to find out when the disk replacement for the Oracle database took place and if it went smoothly.

CMS job failures: In the last two weeks several periods of CMS SRM failures. Investigation on going: Hammer Cloud jobs have been 25% failure rate recently. New monitoring doesn’t allow us to split HC work from normal production. HC failures could be due to problems accessing data at other sites (e.g. not our fault), or could be a problem at RAL.

45% job failure rate at RAL for production work. Significantly higher than other Tier-1s, not just log collect, many normal production jobs. Error code is related to file reads. Echo caching?!

Current operational status and issues

Notable Changes made since the last meeting.

NTR

Entries in GOC DB starting since the last report.

Service	ID	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
-	-	-	-	-	-	-	-

Declared in the GOC DB

Service	ID	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
-	-	-	-	-	-	-	-

No ongoing downtime

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

DNS servers will be rolled out within the Tier1 network.

Open

GGUS Tickets (Snapshot taken during morning of the meeting).

Ticket-ID	Type	VO	Site	Priority	Responsible Unit	Status	Last Update	Subject	Scope
141990	USER	cms	RAL-LCG2	urgent	NGI_UK	in progress	2019-07-01 09:37:00	Intermittent HC failures at T1_UK_RAL	WLCG
141968	USER	cms	RAL-LCG2	very urgent	NGI_UK	in progress	2019-06-28 18:19:00	SAM (CE) and Hammer Cloud Failures at T1_UK_RAL	WLCG
140870	USER	t2k.org	RAL-LCG2	less urgent	NGI_UK	in progress	2019-06-27 13:48:00	Files vanished from RAL tape?	EGI
140447	USER	dteam	RAL-LCG2	less urgent	NGI_UK	on hold	2019-05-22 14:20:00	packet loss outbound from RAL-LCG2 over IPv6	EGI
140220	USER	mice	RAL-LCG2	less urgent	NGI_UK	waiting for reply	2019-06-26 14:49:00	mice LFC to DFC transition	EGI
139672	USER	other	RAL-LCG2	urgent	NGI_UK	waiting for reply	2019-06-17 08:24:00	No LIGO pilots running at RAL	EGI

GGUS Tickets Closed Last week

Ticket-ID	Type	VO	Site	Priority	Responsible Unit	Status	Last Update	Subject	Scope
141901	USER	cms	RAL-LCG2	urgent	NGI_UK	solved	2019-06-26 09:20:00	T1_UK_RAL SRM is timing out	WLCG
141872	TEAM	lhcb	RAL-LCG2	top priority	NGI_UK	verified	2019-06-28 07:11:00	srm-lhcb.gridpp.rl.ac.uk seems in a bad state (time out)	WLCG
141838	USER	cms	RAL-LCG2	urgent	NGI_UK	solved	2019-07-02 14:24:00	Transfers failing from CERN Tape to RAL Disk	WLCG
141704	USER	cms	RAL-LCG2	less urgent	NGI_UK	closed	2019-06-27 23:59:00	PhedEX transfer 1799773	WLCG
141608	USER	snoplus.snolab.ca	RAL-LCG2	less urgent	NGI_UK	solved	2019-07-02 09:27:00	Permissions on RAL SE	EGI
141462	TEAM	lhcb	RAL-LCG2	top priority	NGI_UK	verified	2019-06-30 07:36:00	Error: Connection limit exceeded	WLCG

Availability Report

Day	Atlas	CMS	LHCB	Alice
2019-06-26	100	100	100	100
2019-06-27	100	100	100	100
2019-06-28	100	100	100	100
2019-06-29	100	100	100	100
2019-06-30	100	100	100	100
2019-07-01	100	100	100	100
2019-07-02	98	100	100	100

Hammercloud Test Report

Target Availability for each site is 97.0%

Red <90%

Orange <97%

Day	Atlas HC	CMS HC
2019-06-26	100	98
2019-06-27	100	76
2019-06-28	100	77
2019-06-29	100	76	2019-06-30	100	75
2019-07-01	100	77
2019-07-02	100	77

Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud

Notes from Meeting.

Difference between revisions of "Tier1 Operations Report 2019-07-03"

Latest revision as of 10:08, 3 July 2019

RAL Tier1 Operations Report for 3rd July 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools

@@ Line 10: / Line 10: @@
 | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 26th June 2019 to the 3rd July 2019.
 |}
+* Frontier Service went down over weekend.  Tier-1’s interpretation is that ATLAS were probably doing “something silly” with their US HPCs.  The Production Manager to find out when the disk replacement for the Oracle database took place and if it went smoothly.
+* CMS job failures: In the last two weeks several periods of CMS SRM failures. Investigation on going: Hammer Cloud jobs have been 25% failure rate recently.  New monitoring doesn’t allow us to split HC work from normal production.  HC failures could be due to problems accessing data at other sites (e.g. not our fault), or could be a problem at RAL.
+% job failure rate at RAL for production work.  Significantly higher than other Tier-1s, not just log collect, many normal production jobs.  Error code is related to file reads.  Echo caching?!
 <!-- ***********End Review of Issues during last week*********** ----->
@@ Line 26: / Line 32: @@
 ====== ======
-<!-- ******************************************************* ----->
-<!-- ***********Start Resolved Disk Server Issues*********** ----->
-{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
-|-
-| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Castor Disk Server Issues
-|}
-{| border=1 align=center
-|- bgcolor="#7c8aaf"
-! Machine
-! VO
-! DiskPool
-! dxtx
-! Comments
-|-
-| -
-| -
-| -
-| -
-|
-|-
-|}
-<!-- ***************************************************** ----->
-====== ======
-<!-- *************************************************************** ----->
-<!-- ***************Start Ongoing Disk Server Issues**************** ----->
-{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
-|-
-| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Castor Disk Server Issues
-|}
-{| border=1 align=center
-|- bgcolor="#7c8aaf"
-! Machine
-! VO
-! DiskPool
-! dxtx
-! Comments
-|-
-| -
-| -
-| -
-| -
-| -
-|}
-<!-- ***************End Ongoing Disk Server Issues**************** ----->
-<!-- ************************************************************* ----->
-====== ======
-<!-- ******************************************************************** ----->
-<!-- ******************Start Limits On Batch System Jobs***************** ----->
-{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
-|-
-| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.
-|}
-*
-<!-- ******************End Limits On Batch System Jobs***************** ----->
-<!-- ****************************************************************** ----->
-====== ======
 <!-- ******************************************************************** ----->
 <!-- *************Start Notable Changes made since the last meeting************** ----->
@@ Line 180: / Line 128: @@
 GGUS Tickets (Snapshot taken during morning of the meeting).
 |}
 {| border=1 align=center
 |- bgcolor="#7c8aaf"
@@ Line 195: / Line 141: @@
 ! Scope
 |-
-| 141872
+| 141990
-| TEAM
-| lhcb
-| RAL-LCG2
-| top priority
-| NGI_UK
-| in progress
-| 2019-06-26 08:29:00
-| srm-lhcb.gridpp.rl.ac.uk seems in a bad state (time out)
-| WLCG
-|-
-| 141838
 | USER
 | cms
@@ Line 213: / Line 148: @@
 | NGI_UK
 | in progress
-| 2019-06-24 11:13:00
+| 2019-07-01 09:37:00
-| Transfers failing from CERN Tape to RAL Disk
+| Intermittent HC failures at T1_UK_RAL
 | WLCG
 |-
-| 141608
+| 141968
 | USER
-| snoplus.snolab.ca
+| cms
 | RAL-LCG2
-| less urgent
+| very urgent
 | NGI_UK
 | in progress
-| 2019-06-06 08:55:00
+| 2019-06-28 18:19:00
-| Permissions on RAL SE
+| SAM (CE) and Hammer Cloud Failures at T1_UK_RAL
-| EGI
+| WLCG
 |-
 | 140870
@@ Line 235: / Line 170: @@
 | NGI_UK
 | in progress
-| 2019-06-20 14:35:00
+| 2019-06-27 13:48:00
 | Files vanished from RAL tape?
 | EGI
@@ Line 256: / Line 191: @@
 | less urgent
 | NGI_UK
-| in progress
+| waiting for reply
-| 2019-06-25 13:03:00
+| 2019-06-26 14:49:00
 | mice LFC to DFC transition
 | EGI
@@ Line 272: / Line 207: @@
 | EGI
 |}
 <!-- **********************End Availability Report************************** ----->
 <!-- *********************************************************************** ----->
@@ Line 284: / Line 221: @@
 | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | GGUS Tickets Closed Last week
 |}
 {| border=1 align=center
 |- bgcolor="#7c8aaf"
@@ Line 304: / Line 242: @@
 | NGI_UK
 | solved
-| 2019-06-25 18:49:00
+| 2019-06-26 09:20:00
 | T1_UK_RAL SRM is timing out
 | WLCG
 |-
-| 141771
+| 141872
-| USER
+| TEAM
-| cms
+| lhcb
 | RAL-LCG2
-| urgent
+| top priority
 | NGI_UK
-| solved
+| verified
-| 2019-06-24 14:00:00
+| 2019-06-28 07:11:00
-| file read error at T1_UK_RAL
+| srm-lhcb.gridpp.rl.ac.uk seems in a bad state (time out)
 | WLCG
 |-
-| 141638
+| 141838
 | USER
 | cms
@@ Line 325: / Line 263: @@
 | urgent
 | NGI_UK
-| closed
+| solved
-| 2019-06-25 23:59:00
+| 2019-07-02 14:24:00
-| SAM XROOTD read failure at T1_UK_RAL
+| Transfers failing from CERN Tape to RAL Disk
 | WLCG
 |-
-| 141549
+| 141704
-| TEAM
+| USER
-| atlas
+| cms
 | RAL-LCG2
 | less urgent
 | NGI_UK
 | closed
-| 2019-06-25 23:59:00
+| 2019-06-27 23:59:00
-| ATLAS-RAL-Frontier and some of Lpad-RAL-LCG2 squid degraded
+| PhedEX transfer 1799773
 | WLCG
 |-
-| 141537
+| 141608
-| TEAM
+| USER
-| lhcb
+| snoplus.snolab.ca
 | RAL-LCG2
-| very urgent
+| less urgent
 | NGI_UK
-| verified
+| solved
-| 2019-06-25 12:52:00
+| 2019-07-02 09:27:00
-| Pilots Failed at RAL-LCG2
+| Permissions on RAL SE
-| WLCG
+| EGI
 |-
 | 141462
@@ Line 358: / Line 296: @@
 | top priority
 | NGI_UK
-| solved
+| verified
-| 2019-06-25 15:52:00
+| 2019-06-30 07:36:00
 | Error: Connection limit exceeded
 | WLCG
 |}
 <!-- **********************End Availability Report************************** ----->
 <!-- *********************************************************************** ----->
@@ Line 377: / Line 314: @@
 Availability Report
 |}
 {| border=1 align=center
 |- bgcolor="#7c8aaf"
@@ Line 386: / Line 324: @@
 ! Comments
 |-
-| 2019-06-19
+| 2019-06-26
 | 100
 | 100
@@ Line 393: / Line 331: @@
 |
 |-
-| 2019-06-20
+| 2019-06-27
+| 100
 | 100
-| 86
 | 100
 | 100
 |
 |-
-| 2019-06-21
+| 2019-06-28
+| 100
 | 100
-| 96
 | 100
 | 100
 |
 |-
-| 2019-06-22
+| 2019-06-29
+| 100
 | 100
-| 22
 | 100
 | 100
 |
 |-
-| 2019-06-23
+| 2019-06-30
+| 100
 | 100
-| 80
 | 100
 | 100
 |
 |-
-| 2019-06-24
+| 2019-07-01
+| 100
+| 100
+| 100
 | 100
-| 95
-| 91
-| 93
 |
 |-
-| 2019-06-25
+| 2019-07-02
+| 98
 | 100
-| 62
 | 100
 | 100
@@ Line 453: / Line 391: @@
 ! Day !! Atlas HC !! CMS HC !! Comment
 |-
-| 2019-06-19 || 100 || 98 ||
+| 2019-06-26 || 100 || 98 ||
 |-
-| 2019-06-20 || 100 || 85 ||
+| 2019-06-27 || 100 || 76 ||
 |-
-| 2019-06-21 || 0 || 93 ||
+| 2019-06-28 || 100 || 77 ||
 |-
-| 2019-06-22 || 100 || 98 ||
+| 2019-06-29 || 100 || 76 ||
-|-
+|
-| 2019-06-23 || 100 || 98 ||
+| 2019-06-30 || 100 || 75 ||
 |-
-| 2019-06-24 || 100 || 97 ||
+| 2019-07-01 || 100 || 77 ||
 |-
-| 2019-06-25 || 100 || 97 ||
+| 2019-07-02 || 100 || 77 ||
 |-
 |}