Difference between revisions of "Tier1 Operations Report 2019-06-24"

From GridPP Wiki
Jump to: navigation, search
()
()
 
(9 intermediate revisions by one user not shown)
Line 10: Line 10:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 10th June 2019 to the 17th June 2019.
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 10th June 2019 to the 17th June 2019.
 
|}
 
|}
*  
+
* Scheduled optical replacement work on the Janet Core in London suggested that there could bea prolonged outage at RAL. 
 +
 
 +
  ** Additionally there was concern that IPv6 may break and notfailover correctly (based on previous experience). 
 +
 
 +
  ** In the event, the outage was momentary andno services were impacted.  Both IPv6 and IPv4 failovers worked correctly.
 +
 
 +
* CMS CPU efficiencies are currently describing a veritable sine curve over a weekly period.
 +
 
 +
  ** Investigations seems to suggest a 100% failure of “log collection” jobs at RAL.
 +
 +
  ** However, despite extensive investigation on the part of the Tier-1 Liaison no one seems to know what this job typedoes (other than the obvious), and who is actually responsible for the monitoring/processing ofthis job type as CMS
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- *********************************************************** ----->
 
<!-- *********************************************************** ----->
Line 381: Line 391:
 
! Day
 
! Day
 
! Atlas
 
! Atlas
! Atlas-Echo
 
 
! CMS
 
! CMS
 
! LHCB
 
! LHCB
 
! Alice
 
! Alice
! OPS
 
 
! Comments
 
! Comments
 
|-
 
|-
| 2019-06-11
+
| 2019-06-19
 +
| 100
 
| 100
 
| 100
| na
 
| 87
 
 
| 100
 
| 100
 
| 100
 
| 100
| na
 
 
|  
 
|  
 
|-
 
|-
| 2019-06-12
+
| 2019-06-20
 
| 100
 
| 100
| na
+
| 86
| 95
+
 
| 100
 
| 100
 
| 100
 
| 100
| na
 
 
|  
 
|  
 
|-
 
|-
| 2019-06-13
+
| 2019-06-21
 
| 100
 
| 100
| na
+
| 96
| 97
+
 
| 100
 
| 100
 
| 100
 
| 100
| na
 
 
|  
 
|  
 
|-
 
|-
| 2019-06-14
+
| 2019-06-22
 
| 100
 
| 100
| na
+
| 22
 
| 100
 
| 100
 
| 100
 
| 100
| 100
 
| na
 
 
|  
 
|  
 
|-
 
|-
| 2019-06-15
+
| 2019-06-23
 
| 100
 
| 100
| na
+
| 80
 
| 100
 
| 100
 
| 100
 
| 100
| 100
 
| na
 
 
|  
 
|  
 
|-
 
|-
| 2019-06-16
+
| 2019-06-24
| 100
+
| na
+
| 97
+
| 100
+
 
| 100
 
| 100
| na
+
| 95
 +
| 91
 +
| 93
 
|  
 
|  
 
|-
 
|-
| 2019-06-17
+
| 2019-06-25
 
| 100
 
| 100
| na
+
| 62
 
| 100
 
| 100
 
| 100
 
| 100
| 100
+
|  
| na
+
|
+
 
|}
 
|}
  
Line 459: Line 453:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Hammercloud Test Report
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Hammercloud Test Report
 
|}
 
|}
 +
 
{| border=1 align=center  
 
{| border=1 align=center  
 
| Target Availability for each site is 97.0%  
 
| Target Availability for each site is 97.0%  
Line 468: Line 463:
 
! Day !! Atlas HC !! CMS HC !! Comment
 
! Day !! Atlas HC !! CMS HC !! Comment
 
|-
 
|-
| 2019-06-11 || 100 || 98 ||  
+
| 2019-06-19 || 100 || 98 ||  
 
|-
 
|-
| 2019-06-12 || 100 || 100 ||  
+
| 2019-06-20 || 100 || 85 ||  
 
|-
 
|-
| 2019-06-13 || 100 || 100 ||  
+
| 2019-06-21 || 0 || 93 ||  
 
|-
 
|-
| 2019-06-14 || 100 || 96 ||  
+
| 2019-06-22 || 100 || 98 ||  
 
|-
 
|-
| 2019-06-15 || 100 || 97 ||  
+
| 2019-06-23 || 100 || 98 ||  
 
|-
 
|-
| 2019-06-16 || 100 || 98 ||  
+
| 2019-06-24 || 100 || 97 ||  
|
+
|-
| 2019-06-17 || 100 || 98 ||  
+
| 2019-06-25 || 100 || 97 ||  
 
|-
 
|-
 
|}  
 
|}  
 +
 +
 +
 +
 
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
 
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
 
<!-- **********************End Hammercloud Test Report************************** ----->
 
<!-- **********************End Hammercloud Test Report************************** ----->

Latest revision as of 09:18, 26 June 2019

RAL Tier1 Operations Report for 24th June 2019

Review of Issues during the week 10th June 2019 to the 17th June 2019.
  • Scheduled optical replacement work on the Janet Core in London suggested that there could bea prolonged outage at RAL.
 ** Additionally there was concern that IPv6 may break and notfailover correctly (based on previous experience).  
 ** In the event, the outage was momentary andno services were impacted.   Both IPv6 and IPv4 failovers worked correctly.
  • CMS CPU efficiencies are currently describing a veritable sine curve over a weekly period.
 ** Investigations seems to suggest a 100% failure of “log collection” jobs at RAL. 

 ** However, despite extensive investigation on the part of the Tier-1 Liaison no one seems to know what this job typedoes (other than the obvious), and who is actually responsible for the monitoring/processing ofthis job type as CMS
Current operational status and issues
Resolved Castor Disk Server Issues
Machine VO DiskPool dxtx Comments
- - - -
Ongoing Castor Disk Server Issues
Machine VO DiskPool dxtx Comments
- - - - -
Limits on concurrent batch system jobs.
Notable Changes made since the last meeting.
  • NTR
Entries in GOC DB starting since the last report.
Service ID Scheduled? Outage/At Risk Start End Duration Reason
- - - - - - - -
Declared in the GOC DB
Service ID Scheduled? Outage/At Risk Start End Duration Reason
- - - - - - - -
  • No ongoing downtime
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • DNS servers will be rolled out within the Tier1 network.
Open

GGUS Tickets (Snapshot taken during morning of the meeting).


Ticket-ID Type VO Site Priority Responsible Unit Status Last Update Subject Scope
141872 TEAM lhcb RAL-LCG2 top priority NGI_UK in progress 2019-06-26 08:29:00 srm-lhcb.gridpp.rl.ac.uk seems in a bad state (time out) WLCG
141838 USER cms RAL-LCG2 urgent NGI_UK in progress 2019-06-24 11:13:00 Transfers failing from CERN Tape to RAL Disk WLCG
141608 USER snoplus.snolab.ca RAL-LCG2 less urgent NGI_UK in progress 2019-06-06 08:55:00 Permissions on RAL SE EGI
140870 USER t2k.org RAL-LCG2 less urgent NGI_UK in progress 2019-06-20 14:35:00 Files vanished from RAL tape? EGI
140447 USER dteam RAL-LCG2 less urgent NGI_UK on hold 2019-05-22 14:20:00 packet loss outbound from RAL-LCG2 over IPv6 EGI
140220 USER mice RAL-LCG2 less urgent NGI_UK in progress 2019-06-25 13:03:00 mice LFC to DFC transition EGI
139672 USER other RAL-LCG2 urgent NGI_UK waiting for reply 2019-06-17 08:24:00 No LIGO pilots running at RAL EGI
GGUS Tickets Closed Last week
Ticket-ID Type VO Site Priority Responsible Unit Status Last Update Subject Scope
141901 USER cms RAL-LCG2 urgent NGI_UK solved 2019-06-25 18:49:00 T1_UK_RAL SRM is timing out WLCG
141771 USER cms RAL-LCG2 urgent NGI_UK solved 2019-06-24 14:00:00 file read error at T1_UK_RAL WLCG
141638 USER cms RAL-LCG2 urgent NGI_UK closed 2019-06-25 23:59:00 SAM XROOTD read failure at T1_UK_RAL WLCG
141549 TEAM atlas RAL-LCG2 less urgent NGI_UK closed 2019-06-25 23:59:00 ATLAS-RAL-Frontier and some of Lpad-RAL-LCG2 squid degraded WLCG
141537 TEAM lhcb RAL-LCG2 very urgent NGI_UK verified 2019-06-25 12:52:00 Pilots Failed at RAL-LCG2 WLCG
141462 TEAM lhcb RAL-LCG2 top priority NGI_UK solved 2019-06-25 15:52:00 Error: Connection limit exceeded WLCG


Availability Report

Day Atlas CMS LHCB Alice Comments
2019-06-19 100 100 100 100
2019-06-20 100 86 100 100
2019-06-21 100 96 100 100
2019-06-22 100 22 100 100
2019-06-23 100 80 100 100
2019-06-24 100 95 91 93
2019-06-25 100 62 100 100
Hammercloud Test Report
Target Availability for each site is 97.0% Red <90% Orange <97%
Day Atlas HC CMS HC Comment
2019-06-19 100 98
2019-06-20 100 85
2019-06-21 0 93
2019-06-22 100 98
2019-06-23 100 98
2019-06-24 100 97
2019-06-25 100 97



Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud

Notes from Meeting.