Difference between revisions of "Tier1 Operations Report 2019-11-13"

From GridPP Wiki
Jump to: navigation, search
(Created page with "==RAL Tier1 Operations Report for 13th November 2019== __NOTOC__ ====== ====== <!-- ************************************************************* -----> <!-- ***********Star...")
 
()
 
(7 intermediate revisions by one user not shown)
Line 10: Line 10:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 6th November 2019 to the 12th November 2019.
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 6th November 2019 to the 12th November 2019.
 
|}
 
|}
* FTS failures for CMS to srm-cms.gridpp.rl.ac.uk using FTS due to "//" is actually an issue with rucio testing. KE investigating.
+
* Netowrk issue with single WN casuing failures of transfers for LHCb from WNs to offsite SE.
 +
* Echo monitors new ceph version improved the response time.
 +
* ECHO GW gsiFTP concurrent transfer limit increased.
  
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- ***********End Review of Issues during last week*********** ----->
Line 115: Line 117:
 
'''Listing by category:'''
 
'''Listing by category:'''
  
* DNS servers will be rolled out within the Tier1 network.
 
 
<!-- ***************End Advanced warning for other interventions*************** ----->
 
<!-- ***************End Advanced warning for other interventions*************** ----->
 
<!-- ************************************************************************** ----->
 
<!-- ************************************************************************** ----->
Line 140: Line 141:
 
! Scope
 
! Scope
 
|-
 
|-
| 143916
+
| 144024
 
| USER
 
| USER
 
| cms
 
| cms
 
| RAL-LCG2
 
| RAL-LCG2
| urgent
+
| very urgent
 
| NGI_UK
 
| NGI_UK
 
| in progress
 
| in progress
| 2019-11-05 07:39:00
+
| 2019-11-13 10:31:00
| Transfers failing to T1_UK_RAL_Disk
+
| File Read Issues where files are located at RAL
 
| WLCG
 
| WLCG
 
|-
 
|-
| 143767
+
| 144015
 
| USER
 
| USER
| cms
+
| other
 
| RAL-LCG2
 
| RAL-LCG2
| urgent
+
| less urgent
 
| NGI_UK
 
| NGI_UK
 
| in progress
 
| in progress
| 2019-11-01 14:48:00
+
| 2019-11-12 13:52:00
| FIle read issues for Workflows where data is located at T1_UK_RAL
+
| Stalled LSST jobs at RAL
| WLCG
+
| EGI
 
|-
 
|-
 
| 143762
 
| 143762
Line 217: Line 218:
 
| WLCG
 
| WLCG
 
|}
 
|}
 
 
  
 
<!-- **********************End Availability Report************************** ----->
 
<!-- **********************End Availability Report************************** ----->
Line 232: Line 231:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | GGUS Tickets Closed Last week
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | GGUS Tickets Closed Last week
 
|}
 
|}
 
 
{| border=1 align=center
 
{| border=1 align=center
 
|- bgcolor="#7c8aaf"
 
|- bgcolor="#7c8aaf"
Line 246: Line 244:
 
! Scope
 
! Scope
 
|-
 
|-
| 143917
+
| 143967
 
| USER
 
| USER
 
| cms
 
| cms
Line 253: Line 251:
 
| NGI_UK
 
| NGI_UK
 
| solved
 
| solved
| 2019-11-04 17:08:00
+
| 2019-11-09 00:17:00
| Transfers failing to T1_UK_RAL_Disk
+
| T1_UK_RAL is failing SAM - SRM, XRD
 
| WLCG
 
| WLCG
 
|-
 
|-
| 143876
+
| 143965
| USER
+
| TEAM
| cms
+
| atlas
 
| RAL-LCG2
 
| RAL-LCG2
 
| urgent
 
| urgent
 
| NGI_UK
 
| NGI_UK
 
| solved
 
| solved
| 2019-11-01 14:46:00
+
| 2019-11-08 11:42:00
| T1_UK_RAL HammerCloud cannot reach files via xrootd
+
| RAL-LCG2: TRANSFER [70] TRANSFER globus_ftp_client: the server responded with an error 421
 
| WLCG
 
| WLCG
 
|-
 
|-
| 143874
+
| 143916
 
| USER
 
| USER
| ops
+
| cms
 
| RAL-LCG2
 
| RAL-LCG2
| less urgent
+
| urgent
 
| NGI_UK
 
| NGI_UK
| verified
+
| solved
| 2019-11-01 12:31:00
+
| 2019-11-11 08:39:00
| [Rod Dashboard] Issue detected : org.nagios.BDII-Check@lcgbdii.gridpp.rl.ac.uk
+
| Transfers failing to T1_UK_RAL_Disk
| EGI
+
| WLCG
 
|-
 
|-
 
| 143869
 
| 143869
Line 290: Line 288:
 
| WLCG
 
| WLCG
 
|-
 
|-
| 143838
+
| 143774
| TEAM
+
| USER
| atlas
+
| cms
 
| RAL-LCG2
 
| RAL-LCG2
| less urgent
+
| urgent
 
| NGI_UK
 
| NGI_UK
| solved
+
| closed
| 2019-11-01 11:17:00
+
| 2019-11-08 23:59:00
| RAL-LCG2: TRANSFER an end-of-file was reached globus_xio: An end of file occurred
+
| cernvmfs.gridpp.rl.ac.uk inaccessible over IPv6
| WLCG
+
| EGI
 
|-
 
|-
| 143834
+
| 143767
 
| USER
 
| USER
 
| cms
 
| cms
Line 308: Line 306:
 
| NGI_UK
 
| NGI_UK
 
| solved
 
| solved
| 2019-10-30 11:48:00
+
| 2019-11-11 08:24:00
| transfers failing to T1_UK_RAL_Disk
+
| FIle read issues for Workflows where data is located at T1_UK_RAL
 
| WLCG
 
| WLCG
 
|-
 
|-
| 143831
+
| 143765
| TEAM
+
| USER
| lhcb
+
| cms
 
| RAL-LCG2
 
| RAL-LCG2
| very urgent
+
| urgent
 
| NGI_UK
 
| NGI_UK
| verified
+
| closed
| 2019-10-30 12:28:00
+
| 2019-11-07 23:59:00
| low efficiency at gsiftp://gridftp.echo.stfc.ac.uk
+
| RAL redirector unsubscribed from  federation
 
| WLCG
 
| WLCG
 
|}
 
|}
 
 
 
 
 
 
<!-- **********************End Availability Report************************** ----->
 
<!-- **********************End Availability Report************************** ----->
 
<!-- *********************************************************************** ----->
 
<!-- *********************************************************************** ----->
Line 348: Line 341:
 
! LHCB
 
! LHCB
 
! Alice
 
! Alice
! Comments
 
 
|-
 
|-
| 2019-10-30
+
| 2019-11-06
 
| 100
 
| 100
 +
| 97
 
| 100
 
| 100
 
| 100
 
| 100
| 100
 
|
 
 
|-
 
|-
| 2019-10-31
+
| 2019-11-07
 
| 100
 
| 100
 +
| 87
 
| 100
 
| 100
 
| 100
 
| 100
| 100
 
|
 
 
|-
 
|-
| 2019-11-01
+
| 2019-11-08
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
|
 
 
|-
 
|-
| 2019-11-02
+
| 2019-11-09
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
|
 
 
|-
 
|-
| 2019-11-03
+
| 2019-11-10
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
|
 
 
|-
 
|-
| 2019-11-04
+
| 2019-11-11
 +
| 100
 
| 100
 
| 100
| 97
 
 
| 100
 
| 100
 
| 100
 
| 100
|
 
 
|-
 
|-
| 2019-11-05
+
| 2019-11-12
 +
| 100
 
| 100
 
| 100
| 97
 
 
| 100
 
| 100
 
| 100
 
| 100
|
 
 
|}
 
|}
  
Line 416: Line 401:
 
! Day !! Atlas HC !! CMS HC !! Comment
 
! Day !! Atlas HC !! CMS HC !! Comment
 
|-
 
|-
| 2019-10-30 || 100 || 64||  
+
| 2019-11-06 || 100 || 98||  
 
|-
 
|-
| 2019-10-31 || 96 || 97 ||  
+
| 2019-11-07 || 100 || n/a||  
 
|-
 
|-
| 2019-10-01 || 100 || 96||  
+
| 2019-11-08 || 100 || n/a||  
 
|-
 
|-
| 2019-10-02 || 100 || 97 ||  
+
| 2019-11-09 || 0 || 93 ||  
 
|-
 
|-
| 2019-10-03 || 96 || 98||  
+
| 2019-11-10 || 0 || n/a||  
 
|-
 
|-
| 2019-10-04|| 100|| 99||  
+
| 2019-11-11|| 0|| 88||  
 
|-
 
|-
| 2019-10-05 || 100 || 99 ||  
+
| 2019-11-12 || 100 || 88 ||  
 
|-
 
|-
 
|}  
 
|}  

Latest revision as of 13:11, 13 November 2019

RAL Tier1 Operations Report for 13th November 2019

Review of Issues during the week 6th November 2019 to the 12th November 2019.
  • Netowrk issue with single WN casuing failures of transfers for LHCb from WNs to offsite SE.
  • Echo monitors new ceph version improved the response time.
  • ECHO GW gsiFTP concurrent transfer limit increased.


Current operational status and issues
Notable Changes made since the last meeting.
  • NTR
Entries in GOC DB starting since the last report.
Service ID Scheduled? Outage/At Risk Start End Duration Reason
Declared in the GOC DB
Service ID Scheduled? Outage/At Risk Start End Duration Reason
- - - - - - - -
  • No ongoing downtime
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.


Listing by category:


Open GGUS Tickets
Ticket-ID Type VO Site Priority Responsible Unit Status Last Update Subject Scope
144024 USER cms RAL-LCG2 very urgent NGI_UK in progress 2019-11-13 10:31:00 File Read Issues where files are located at RAL WLCG
144015 USER other RAL-LCG2 less urgent NGI_UK in progress 2019-11-12 13:52:00 Stalled LSST jobs at RAL EGI
143762 TEAM lhcb RAL-LCG2 urgent NGI_UK in progress 2019-10-23 14:12:00 Stop using sl6 queues at RAL WLCG
143669 USER snoplus.snolab.ca RAL-LCG2 urgent NGI_UK in progress 2019-10-18 14:25:00 SNO+ LFC to DFC migration EGI
143645 TEAM lhcb RAL-LCG2 top priority NGI_UK on hold 2019-10-30 14:42:00 Jobs Failed to access files at RAL-LCG2 WLCG
143323 TEAM lhcb RAL-LCG2 top priority NGI_UK on hold 2019-10-30 14:43:00 File deletion at RAL ECHO WLCG
142350 TEAM lhcb RAL-LCG2 top priority NGI_UK on hold 2019-10-30 14:44:00 Proble accessing some LHCb files at RAL WLCG


GGUS Tickets Closed Last week
Ticket-ID Type VO Site Priority Responsible Unit Status Last Update Subject Scope
143967 USER cms RAL-LCG2 urgent NGI_UK solved 2019-11-09 00:17:00 T1_UK_RAL is failing SAM - SRM, XRD WLCG
143965 TEAM atlas RAL-LCG2 urgent NGI_UK solved 2019-11-08 11:42:00 RAL-LCG2: TRANSFER [70] TRANSFER globus_ftp_client: the server responded with an error 421 WLCG
143916 USER cms RAL-LCG2 urgent NGI_UK solved 2019-11-11 08:39:00 Transfers failing to T1_UK_RAL_Disk WLCG
143869 TEAM lhcb RAL-LCG2 very urgent NGI_UK verified 2019-11-06 11:31:00 (again) file transfers low efficiency WLCG
143774 USER cms RAL-LCG2 urgent NGI_UK closed 2019-11-08 23:59:00 cernvmfs.gridpp.rl.ac.uk inaccessible over IPv6 EGI
143767 USER cms RAL-LCG2 urgent NGI_UK solved 2019-11-11 08:24:00 FIle read issues for Workflows where data is located at T1_UK_RAL WLCG
143765 USER cms RAL-LCG2 urgent NGI_UK closed 2019-11-07 23:59:00 RAL redirector unsubscribed from federation WLCG

Availability Report

Day Atlas CMS LHCB Alice
2019-11-06 100 97 100 100
2019-11-07 100 87 100 100
2019-11-08 100 100 100 100
2019-11-09 100 100 100 100
2019-11-10 100 100 100 100
2019-11-11 100 100 100 100
2019-11-12 100 100 100 100
Hammercloud Test Report
Target Availability for each site is 97.0%
Day Atlas HC CMS HC Comment
2019-11-06 100 98
2019-11-07 100 n/a
2019-11-08 100 n/a
2019-11-09 0 93
2019-11-10 0 n/a
2019-11-11 0 88
2019-11-12 100 88

Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud

Notes from Meeting.