Difference between revisions of "Tier1 Operations Report 2019-06-24"

From GridPP Wiki
Jump to: navigation, search
()
 
(11 intermediate revisions by one user not shown)
Line 10: Line 10:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 10th June 2019 to the 17th June 2019.
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 10th June 2019 to the 17th June 2019.
 
|}
 
|}
*  
+
* Scheduled optical replacement work on the Janet Core in London suggested that there could bea prolonged outage at RAL. 
 +
 
 +
  ** Additionally there was concern that IPv6 may break and notfailover correctly (based on previous experience). 
 +
 
 +
  ** In the event, the outage was momentary andno services were impacted.  Both IPv6 and IPv4 failovers worked correctly.
 +
 
 +
* CMS CPU efficiencies are currently describing a veritable sine curve over a weekly period.
 +
 
 +
  ** Investigations seems to suggest a 100% failure of “log collection” jobs at RAL.
 +
 +
  ** However, despite extensive investigation on the part of the Tier-1 Liaison no one seems to know what this job typedoes (other than the obvious), and who is actually responsible for the monitoring/processing ofthis job type as CMS
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- *********************************************************** ----->
 
<!-- *********************************************************** ----->
Line 180: Line 190:
 
GGUS Tickets (Snapshot taken during morning of the meeting).   
 
GGUS Tickets (Snapshot taken during morning of the meeting).   
 
|}
 
|}
 +
 +
 
{| border=1 align=center
 
{| border=1 align=center
 
|- bgcolor="#7c8aaf"
 
|- bgcolor="#7c8aaf"
Line 193: Line 205:
 
! Scope
 
! Scope
 
|-
 
|-
| 141771
+
| 141872
 +
| TEAM
 +
| lhcb
 +
| RAL-LCG2
 +
| top priority
 +
| NGI_UK
 +
| in progress
 +
| 2019-06-26 08:29:00
 +
| srm-lhcb.gridpp.rl.ac.uk seems in a bad state (time out)
 +
| WLCG
 +
|-
 +
| 141838
 
| USER
 
| USER
 
| cms
 
| cms
Line 200: Line 223:
 
| NGI_UK
 
| NGI_UK
 
| in progress
 
| in progress
| 2019-06-19 09:13:00
+
| 2019-06-24 11:13:00
| file read error at T1_UK_RAL
+
| Transfers failing from CERN Tape to RAL Disk
 
| WLCG
 
| WLCG
 
|-
 
|-
Line 214: Line 237:
 
| Permissions on RAL SE
 
| Permissions on RAL SE
 
| EGI
 
| EGI
|-
 
| 141537
 
| TEAM
 
| lhcb
 
| RAL-LCG2
 
| very urgent
 
| NGI_UK
 
| in progress
 
| 2019-05-31 19:28:00
 
| Pilots Failed at RAL-LCG2
 
| WLCG
 
|-
 
| 141462
 
| TEAM
 
| lhcb
 
| RAL-LCG2
 
| top priority
 
| NGI_UK
 
| in progress
 
| 2019-06-19 08:11:00
 
| Error: Connection limit exceeded
 
| WLCG
 
 
|-
 
|-
 
| 140870
 
| 140870
Line 243: Line 244:
 
| less urgent
 
| less urgent
 
| NGI_UK
 
| NGI_UK
| waiting for reply
+
| in progress
| 2019-06-18 12:13:00
+
| 2019-06-20 14:35:00
 
| Files vanished from RAL tape?
 
| Files vanished from RAL tape?
 
| EGI
 
| EGI
Line 266: Line 267:
 
| NGI_UK
 
| NGI_UK
 
| in progress
 
| in progress
| 2019-06-18 13:31:00
+
| 2019-06-25 13:03:00
 
| mice LFC to DFC transition
 
| mice LFC to DFC transition
 
| EGI
 
| EGI
Line 306: Line 307:
 
! Scope
 
! Scope
 
|-
 
|-
| 141704
+
| 141901
 
| USER
 
| USER
 
| cms
 
| cms
 
| RAL-LCG2
 
| RAL-LCG2
| less urgent
+
| urgent
 
| NGI_UK
 
| NGI_UK
 
| solved
 
| solved
| 2019-06-13 14:16:00
+
| 2019-06-25 18:49:00
| PhedEX transfer 1799773
+
| T1_UK_RAL SRM is timing out
 
| WLCG
 
| WLCG
 
|-
 
|-
| 141262
+
| 141771
 +
| USER
 +
| cms
 +
| RAL-LCG2
 +
| urgent
 +
| NGI_UK
 +
| solved
 +
| 2019-06-24 14:00:00
 +
| file read error at T1_UK_RAL
 +
| WLCG
 +
|-
 +
| 141638
 +
| USER
 +
| cms
 +
| RAL-LCG2
 +
| urgent
 +
| NGI_UK
 +
| closed
 +
| 2019-06-25 23:59:00
 +
| SAM XROOTD read failure at T1_UK_RAL
 +
| WLCG
 +
|-
 +
| 141549
 +
| TEAM
 +
| atlas
 +
| RAL-LCG2
 +
| less urgent
 +
| NGI_UK
 +
| closed
 +
| 2019-06-25 23:59:00
 +
| ATLAS-RAL-Frontier and some of Lpad-RAL-LCG2 squid degraded
 +
| WLCG
 +
|-
 +
| 141537
 
| TEAM
 
| TEAM
 
| lhcb
 
| lhcb
Line 324: Line 358:
 
| NGI_UK
 
| NGI_UK
 
| verified
 
| verified
| 2019-06-12 16:02:00
+
| 2019-06-25 12:52:00
| Users are getting [FATAL] Auth failed
+
| Pilots Failed at RAL-LCG2
 +
| WLCG
 +
|-
 +
| 141462
 +
| TEAM
 +
| lhcb
 +
| RAL-LCG2
 +
| top priority
 +
| NGI_UK
 +
| solved
 +
| 2019-06-25 15:52:00
 +
| Error: Connection limit exceeded
 
| WLCG
 
| WLCG
 
|}
 
|}
 +
 
<!-- **********************End Availability Report************************** ----->
 
<!-- **********************End Availability Report************************** ----->
 
<!-- *********************************************************************** ----->
 
<!-- *********************************************************************** ----->
Line 345: Line 391:
 
! Day
 
! Day
 
! Atlas
 
! Atlas
! Atlas-Echo
 
 
! CMS
 
! CMS
 
! LHCB
 
! LHCB
 
! Alice
 
! Alice
! OPS
 
 
! Comments
 
! Comments
 
|-
 
|-
| 2019-06-11
+
| 2019-06-19
 +
| 100
 
| 100
 
| 100
| na
 
| 87
 
 
| 100
 
| 100
 
| 100
 
| 100
| na
 
 
|  
 
|  
 
|-
 
|-
| 2019-06-12
+
| 2019-06-20
 
| 100
 
| 100
| na
+
| 86
| 95
+
 
| 100
 
| 100
 
| 100
 
| 100
| na
 
 
|  
 
|  
 
|-
 
|-
| 2019-06-13
+
| 2019-06-21
 
| 100
 
| 100
| na
+
| 96
| 97
+
 
| 100
 
| 100
 
| 100
 
| 100
| na
 
 
|  
 
|  
 
|-
 
|-
| 2019-06-14
+
| 2019-06-22
 
| 100
 
| 100
| na
+
| 22
 
| 100
 
| 100
 
| 100
 
| 100
| 100
 
| na
 
 
|  
 
|  
 
|-
 
|-
| 2019-06-15
+
| 2019-06-23
 
| 100
 
| 100
| na
+
| 80
 
| 100
 
| 100
 
| 100
 
| 100
| 100
 
| na
 
 
|  
 
|  
 
|-
 
|-
| 2019-06-16
+
| 2019-06-24
| 100
+
| na
+
| 97
+
| 100
+
 
| 100
 
| 100
| na
+
| 95
 +
| 91
 +
| 93
 
|  
 
|  
 
|-
 
|-
| 2019-06-17
+
| 2019-06-25
 
| 100
 
| 100
| na
+
| 62
 
| 100
 
| 100
 
| 100
 
| 100
| 100
+
|  
| na
+
|
+
 
|}
 
|}
  
Line 423: Line 453:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Hammercloud Test Report
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Hammercloud Test Report
 
|}
 
|}
 +
 
{| border=1 align=center  
 
{| border=1 align=center  
 
| Target Availability for each site is 97.0%  
 
| Target Availability for each site is 97.0%  
Line 432: Line 463:
 
! Day !! Atlas HC !! CMS HC !! Comment
 
! Day !! Atlas HC !! CMS HC !! Comment
 
|-
 
|-
| 2019-06-11 || 100 || 98 ||  
+
| 2019-06-19 || 100 || 98 ||  
 
|-
 
|-
| 2019-06-12 || 100 || 100 ||  
+
| 2019-06-20 || 100 || 85 ||  
 
|-
 
|-
| 2019-06-13 || 100 || 100 ||  
+
| 2019-06-21 || 0 || 93 ||  
 
|-
 
|-
| 2019-06-14 || 100 || 96 ||  
+
| 2019-06-22 || 100 || 98 ||  
 
|-
 
|-
| 2019-06-15 || 100 || 97 ||  
+
| 2019-06-23 || 100 || 98 ||  
 
|-
 
|-
| 2019-06-16 || 100 || 98 ||  
+
| 2019-06-24 || 100 || 97 ||  
|
+
|-
| 2019-06-17 || 100 || 98 ||  
+
| 2019-06-25 || 100 || 97 ||  
 
|-
 
|-
 
|}  
 
|}  
 +
 +
 +
 +
 
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
 
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
 
<!-- **********************End Hammercloud Test Report************************** ----->
 
<!-- **********************End Hammercloud Test Report************************** ----->

Latest revision as of 09:18, 26 June 2019

RAL Tier1 Operations Report for 24th June 2019

Review of Issues during the week 10th June 2019 to the 17th June 2019.
  • Scheduled optical replacement work on the Janet Core in London suggested that there could bea prolonged outage at RAL.
 ** Additionally there was concern that IPv6 may break and notfailover correctly (based on previous experience).  
 ** In the event, the outage was momentary andno services were impacted.   Both IPv6 and IPv4 failovers worked correctly.
  • CMS CPU efficiencies are currently describing a veritable sine curve over a weekly period.
 ** Investigations seems to suggest a 100% failure of “log collection” jobs at RAL. 

 ** However, despite extensive investigation on the part of the Tier-1 Liaison no one seems to know what this job typedoes (other than the obvious), and who is actually responsible for the monitoring/processing ofthis job type as CMS
Current operational status and issues
Resolved Castor Disk Server Issues
Machine VO DiskPool dxtx Comments
- - - -
Ongoing Castor Disk Server Issues
Machine VO DiskPool dxtx Comments
- - - - -
Limits on concurrent batch system jobs.
Notable Changes made since the last meeting.
  • NTR
Entries in GOC DB starting since the last report.
Service ID Scheduled? Outage/At Risk Start End Duration Reason
- - - - - - - -
Declared in the GOC DB
Service ID Scheduled? Outage/At Risk Start End Duration Reason
- - - - - - - -
  • No ongoing downtime
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • DNS servers will be rolled out within the Tier1 network.
Open

GGUS Tickets (Snapshot taken during morning of the meeting).


Ticket-ID Type VO Site Priority Responsible Unit Status Last Update Subject Scope
141872 TEAM lhcb RAL-LCG2 top priority NGI_UK in progress 2019-06-26 08:29:00 srm-lhcb.gridpp.rl.ac.uk seems in a bad state (time out) WLCG
141838 USER cms RAL-LCG2 urgent NGI_UK in progress 2019-06-24 11:13:00 Transfers failing from CERN Tape to RAL Disk WLCG
141608 USER snoplus.snolab.ca RAL-LCG2 less urgent NGI_UK in progress 2019-06-06 08:55:00 Permissions on RAL SE EGI
140870 USER t2k.org RAL-LCG2 less urgent NGI_UK in progress 2019-06-20 14:35:00 Files vanished from RAL tape? EGI
140447 USER dteam RAL-LCG2 less urgent NGI_UK on hold 2019-05-22 14:20:00 packet loss outbound from RAL-LCG2 over IPv6 EGI
140220 USER mice RAL-LCG2 less urgent NGI_UK in progress 2019-06-25 13:03:00 mice LFC to DFC transition EGI
139672 USER other RAL-LCG2 urgent NGI_UK waiting for reply 2019-06-17 08:24:00 No LIGO pilots running at RAL EGI
GGUS Tickets Closed Last week
Ticket-ID Type VO Site Priority Responsible Unit Status Last Update Subject Scope
141901 USER cms RAL-LCG2 urgent NGI_UK solved 2019-06-25 18:49:00 T1_UK_RAL SRM is timing out WLCG
141771 USER cms RAL-LCG2 urgent NGI_UK solved 2019-06-24 14:00:00 file read error at T1_UK_RAL WLCG
141638 USER cms RAL-LCG2 urgent NGI_UK closed 2019-06-25 23:59:00 SAM XROOTD read failure at T1_UK_RAL WLCG
141549 TEAM atlas RAL-LCG2 less urgent NGI_UK closed 2019-06-25 23:59:00 ATLAS-RAL-Frontier and some of Lpad-RAL-LCG2 squid degraded WLCG
141537 TEAM lhcb RAL-LCG2 very urgent NGI_UK verified 2019-06-25 12:52:00 Pilots Failed at RAL-LCG2 WLCG
141462 TEAM lhcb RAL-LCG2 top priority NGI_UK solved 2019-06-25 15:52:00 Error: Connection limit exceeded WLCG


Availability Report

Day Atlas CMS LHCB Alice Comments
2019-06-19 100 100 100 100
2019-06-20 100 86 100 100
2019-06-21 100 96 100 100
2019-06-22 100 22 100 100
2019-06-23 100 80 100 100
2019-06-24 100 95 91 93
2019-06-25 100 62 100 100
Hammercloud Test Report
Target Availability for each site is 97.0% Red <90% Orange <97%
Day Atlas HC CMS HC Comment
2019-06-19 100 98
2019-06-20 100 85
2019-06-21 0 93
2019-06-22 100 98
2019-06-23 100 98
2019-06-24 100 97
2019-06-25 100 97



Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud

Notes from Meeting.