Difference between revisions of "Tier1 Operations Report 2019-07-31"

From GridPP Wiki
Jump to: navigation, search
()
()
 
(11 intermediate revisions by one user not shown)
Line 8: Line 8:
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
|-
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 17th July2019 to the 24th July 2019.
+
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 25th July2019 to the 31st July 2019.
 
|}
 
|}
* A
+
* Investigation of LHCb file access problem is ongoing.
 +
* OPN call-out on Tuesday morning.  Partial failure did not affect service availability.  Problem assumed to be transient and not investigated further.
 +
** Possible site connectitivty change by DI to happen this week.
 +
* Echo xrootd gateway issues
 +
**Led to drain and reboot of Batch farm
 +
 
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- *********************************************************** ----->
 
<!-- *********************************************************** ----->
Line 21: Line 26:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues
 
|}
 
|}
*
+
* CMS AAA issues.
 +
* Farm recovery from Docler issue relating to xrootd.
 
<!-- ***********End Current operational status and issues*********** ----->
 
<!-- ***********End Current operational status and issues*********** ----->
 
<!-- *************************************************************** ----->
 
<!-- *************************************************************** ----->
Line 33: Line 39:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting.
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting.
 
|}
 
|}
* NTR
+
* Test FTS instacne upgraded
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- ****************************************************************** ----->
 
<!-- ****************************************************************** ----->
Line 109: Line 115:
 
|}
 
|}
 
<!-- ******* still to be formally scheduled and/or announced ******* ----->
 
<!-- ******* still to be formally scheduled and/or announced ******* ----->
 +
 +
 
'''Listing by category:'''
 
'''Listing by category:'''
 +
* FTS Prodcution instance to be upgraded (needing service downtime.) Proposed 6/8/19 if acceptable by VOs.
 
* DNS servers will be rolled out within the Tier1 network.
 
* DNS servers will be rolled out within the Tier1 network.
 
<!-- ***************End Advanced warning for other interventions*************** ----->
 
<!-- ***************End Advanced warning for other interventions*************** ----->
Line 205: Line 214:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | GGUS Tickets Closed Last week
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | GGUS Tickets Closed Last week
 
|}
 
|}
 +
 +
 
{| border=1 align=center
 
{| border=1 align=center
 
|- bgcolor="#7c8aaf"
 
|- bgcolor="#7c8aaf"
Line 218: Line 229:
 
! Scope
 
! Scope
 
|-
 
|-
| 141990
+
| 142264
 
| USER
 
| USER
 
| cms
 
| cms
Line 225: Line 236:
 
| NGI_UK
 
| NGI_UK
 
| closed
 
| closed
| 2019-07-23 23:59:00
+
| 2019-07-30 23:59:00
| Intermittent HC failures at T1_UK_RAL
+
| Sam Test in warning at T1_UK_RAL
 
| WLCG
 
| WLCG
 
|-
 
|-
| 141968
+
| 142251
 
| USER
 
| USER
| cms
+
| snoplus.snolab.ca
 
| RAL-LCG2
 
| RAL-LCG2
| very urgent
+
| urgent
 
| NGI_UK
 
| NGI_UK
 
| closed
 
| closed
| 2019-07-18 23:59:00
+
| 2019-07-29 23:59:00
| SAM (CE) and Hammer Cloud Failures at T1_UK_RAL
+
| Transfers to RAL (srm-snoplus.gridpp.rl.ac.uk) are failing
| WLCG
+
| EGI
 
|-
 
|-
| 139672
+
| 142155
 
| USER
 
| USER
| other
+
| cms
 
| RAL-LCG2
 
| RAL-LCG2
 
| urgent
 
| urgent
 
| NGI_UK
 
| NGI_UK
 
| closed
 
| closed
| 2019-07-23 23:59:00
+
| 2019-07-25 23:59:00
| No LIGO pilots running at RAL
+
| Transfers are failing from UK to KIPT
| EGI
+
| WLCG
 
|}
 
|}
 
<!-- **********************End Availability Report************************** ----->
 
<!-- **********************End Availability Report************************** ----->
Line 264: Line 275:
 
Availability Report
 
Availability Report
 
|}
 
|}
 +
 +
 
{| border=1 align=center
 
{| border=1 align=center
 
|- bgcolor="#7c8aaf"
 
|- bgcolor="#7c8aaf"
Line 273: Line 286:
 
! Comments
 
! Comments
 
|-
 
|-
| 2019-07-17
+
| 2019-07-24
 
| 100
 
| 100
 
| 100
 
| 100
Line 280: Line 293:
 
|  
 
|  
 
|-
 
|-
| 2019-07-18
+
| 2019-07-25
 
| 100
 
| 100
 
| 100
 
| 100
Line 287: Line 300:
 
|  
 
|  
 
|-
 
|-
| 2019-07-19
+
| 2019-07-26
 
| 100
 
| 100
 
| 100
 
| 100
Line 294: Line 307:
 
|  
 
|  
 
|-
 
|-
| 2019-07-20
+
| 2019-07-27
 
| 100
 
| 100
 
| 100
 
| 100
Line 301: Line 314:
 
|  
 
|  
 
|-
 
|-
| 2019-07-21
+
| 2019-07-28
 
| 100
 
| 100
 
| 100
 
| 100
Line 308: Line 321:
 
|  
 
|  
 
|-
 
|-
| 2019-07-22
+
| 2019-07-29
 
| 100
 
| 100
 
| 100
 
| 100
Line 315: Line 328:
 
|  
 
|  
 
|-
 
|-
| 2019-07-23
+
| 2019-07-30
| 100
+
| 100
+
| 100
+
| 100
+
|
+
|-
+
| 2019-07-24
+
| 100
+
 
| 100
 
| 100
 +
| 91
 
| 100
 
| 100
 
| 100
 
| 100
Line 346: Line 352:
 
! Day !! Atlas HC !! CMS HC !! Comment
 
! Day !! Atlas HC !! CMS HC !! Comment
 
|-
 
|-
| 2019-06-17 || 100 || 93 ||  
+
| 2019-06-24 || 100 || 100 ||  
 
|-
 
|-
| 2019-06-18 || 100 || N/A ||  
+
| 2019-06-25 || 62 || 100 ||  
 
|-
 
|-
| 2019-06-19 || 100 || 95 ||  
+
| 2019-06-26 || 100 || n/a ||  
 
|-
 
|-
| 2019-06-20 || 100 || 90 ||  
+
| 2019-06-27 || 100 || 100 ||  
 
|-
 
|-
| 2019-06-21 || 100 || 92 ||  
+
| 2019-06-28 || 100 || 100||  
 
|-
 
|-
| 2019-07-22|| 100 || N/A||  
+
| 2019-07-29|| 100 || 96||  
 
|-
 
|-
| 2019-07-23 || 100 || N/A ||  
+
| 2019-07-30 || 96 || 96 ||  
 
|-
 
|-
 
|}  
 
|}  

Latest revision as of 11:39, 31 July 2019

RAL Tier1 Operations Report for 31st July 2019

Review of Issues during the week 25th July2019 to the 31st July 2019.
  • Investigation of LHCb file access problem is ongoing.
  • OPN call-out on Tuesday morning. Partial failure did not affect service availability. Problem assumed to be transient and not investigated further.
    • Possible site connectitivty change by DI to happen this week.
  • Echo xrootd gateway issues
    • Led to drain and reboot of Batch farm


Current operational status and issues
  • CMS AAA issues.
  • Farm recovery from Docler issue relating to xrootd.
Notable Changes made since the last meeting.
  • Test FTS instacne upgraded
Entries in GOC DB starting since the last report.
Service ID Scheduled? Outage/At Risk Start End Duration Reason
- - - - - - - -
Declared in the GOC DB
Service ID Scheduled? Outage/At Risk Start End Duration Reason
- - - - - - - -
  • No ongoing downtime
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.


Listing by category:

  • FTS Prodcution instance to be upgraded (needing service downtime.) Proposed 6/8/19 if acceptable by VOs.
  • DNS servers will be rolled out within the Tier1 network.
Open GGUS Tickets
Ticket-ID Type VO Site Priority Responsible Unit Status Last Update Subject Scope
142350 TEAM lhcb RAL-LCG2 top priority NGI_UK in progress 2019-07-29 16:14:00 Proble accessing some LHCb files at RAL WLCG
142337 TEAM lhcb RAL-LCG2 very urgent NGI_UK in progress 2019-07-19 12:14:00 Pilots Failed at RAL-LCG2 WLCG
142203 TEAM atlas RAL-LCG2 urgent NGI_UK on hold 2019-07-24 18:33:00 RAL-LCG2_MCORE jobs failing WLCG
140447 USER dteam RAL-LCG2 less urgent NGI_UK on hold 2019-07-10 13:41:00 packet loss outbound from RAL-LCG2 over IPv6 EGI
140220 USER mice RAL-LCG2 less urgent NGI_UK waiting for reply 2019-07-29 14:08:00 mice LFC to DFC transition EGI



GGUS Tickets Closed Last week


Ticket-ID Type VO Site Priority Responsible Unit Status Last Update Subject Scope
142264 USER cms RAL-LCG2 urgent NGI_UK closed 2019-07-30 23:59:00 Sam Test in warning at T1_UK_RAL WLCG
142251 USER snoplus.snolab.ca RAL-LCG2 urgent NGI_UK closed 2019-07-29 23:59:00 Transfers to RAL (srm-snoplus.gridpp.rl.ac.uk) are failing EGI
142155 USER cms RAL-LCG2 urgent NGI_UK closed 2019-07-25 23:59:00 Transfers are failing from UK to KIPT WLCG

Availability Report


Day Atlas CMS LHCB Alice Comments
2019-07-24 100 100 100 100
2019-07-25 100 100 100 100
2019-07-26 100 100 100 100
2019-07-27 100 100 100 100
2019-07-28 100 100 100 100
2019-07-29 100 100 100 100
2019-07-30 100 91 100 100
Hammercloud Test Report
Target Availability for each site is 97.0%
Day Atlas HC CMS HC Comment
2019-06-24 100 100
2019-06-25 62 100
2019-06-26 100 n/a
2019-06-27 100 100
2019-06-28 100 100
2019-07-29 100 96
2019-07-30 96 96

Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud

Notes from Meeting.