Difference between revisions of "Tier1 Operations Report 2018-05-28"

From GridPP Wiki
Jump to: navigation, search
()
Line 9: Line 9:
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
|-
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 22nd May to the 28th May 2018.
+
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 14th May to the 21st May 2018.
 
|}
 
|}
* Production has had a reasonably quite week.  We have had a noticeable spike in drive failures for LHCb, however we have not had the need to declare data loss.
+
* No incidents(major or minor), have been flagged during this reporting period.
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- *********************************************************** ----->
 
<!-- *********************************************************** ----->
Line 33: Line 33:
 
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Castor Disk Server Issues
 
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Castor Disk Server Issues
 
|}
 
|}
* gdss700 (lhcbDst- D1T0) - Back in production RW
+
* gdss732 (lhcbDst- D1T0) - Back in production after completion of rebuilding of the replacement drive.
* gdss711 (lhcbDst- D1T0) - Back in production RO
+
 
<!-- ***************************************************** ----->
 
<!-- ***************************************************** ----->
  
Line 44: Line 43:
 
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Castor Disk Server Issues
 
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Castor Disk Server Issues
 
|}
 
|}
* gdss732 (lhcbDst- D1T0) - Currently in intervention
+
* None
 
<!-- ***************End Ongoing Disk Server Issues**************** ----->
 
<!-- ***************End Ongoing Disk Server Issues**************** ----->
 
<!-- ************************************************************* ----->
 
<!-- ************************************************************* ----->
Line 124: Line 123:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Open  
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Open  
 
GGUS Tickets (Snapshot during morning of the report).  The latest ticket snapshot can be found here[https://ggus.eu/index.php?mode=ticket_search&show_columns_check%5B%5D=TICKET_TYPE&show_columns_check%5B%5D=AFFECTED_VO&show_columns_check%5B%5D=AFFECTED_SITE&show_columns_check%5B%5D=PRIORITY&show_columns_check%5B%5D=RESPONSIBLE_UNIT&show_columns_check%5B%5D=STATUS&show_columns_check%5B%5D=DATE_OF_CHANGE&show_columns_check%5B%5D=SHORT_DESCRIPTION&show_columns_check%5B%5D=SCOPE&ticket_id=&supportunit=&su_hierarchy=0&former_su=&vo=&user=&keyword=&involvedsupporter=&assignedto=&affectedsite=RAL-LCG2&specattrib=none&status=open&priority=&typeofproblem=all&ticket_category=all&mouarea=&date_type=creation+date&tf_radio=1&timeframe=any&from_date=14+May+2018&to_date=22+May+2018&untouched_date=&scope=&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO%21].
 
GGUS Tickets (Snapshot during morning of the report).  The latest ticket snapshot can be found here[https://ggus.eu/index.php?mode=ticket_search&show_columns_check%5B%5D=TICKET_TYPE&show_columns_check%5B%5D=AFFECTED_VO&show_columns_check%5B%5D=AFFECTED_SITE&show_columns_check%5B%5D=PRIORITY&show_columns_check%5B%5D=RESPONSIBLE_UNIT&show_columns_check%5B%5D=STATUS&show_columns_check%5B%5D=DATE_OF_CHANGE&show_columns_check%5B%5D=SHORT_DESCRIPTION&show_columns_check%5B%5D=SCOPE&ticket_id=&supportunit=&su_hierarchy=0&former_su=&vo=&user=&keyword=&involvedsupporter=&assignedto=&affectedsite=RAL-LCG2&specattrib=none&status=open&priority=&typeofproblem=all&ticket_category=all&mouarea=&date_type=creation+date&tf_radio=1&timeframe=any&from_date=14+May+2018&to_date=22+May+2018&untouched_date=&scope=&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO%21].
 +
 
|}
 
|}
 
{| border=1 align=center
 
{| border=1 align=center
Line 136: Line 136:
 
! Subject
 
! Subject
 
! Scope
 
! Scope
|-
 
| style="background-color: green;" | 135342
 
| ops
 
| in progress
 
| less urgent
 
| 27/05/2018
 
| 28/05/2018
 
| Operations
 
| [Rod Dashboard] Issue detected : egi.eu.lowAvailability-/RAL-LCG2@RAL-LCG2_Availability
 
| EGI
 
|-
 
| style="background-color: green;" | 135308
 
| mice
 
| in progress
 
| top priority
 
| 24/05/2018
 
| 29/05/2018
 
| Information System
 
| Can't send data to RAL Castor
 
| EGI
 
|-
 
| style="background-color: green;" | 135293
 
| ops
 
| in progress
 
| less urgent
 
| 23/05/2018
 
| 27/05/2018
 
| Operations
 
| [Rod Dashboard] Issues detected at RAL-LCG2
 
| EGI
 
 
|-
 
|-
 
| style="background-color: green;" | 135133
 
| style="background-color: green;" | 135133
Line 172: Line 142:
 
| urgent
 
| urgent
 
| 15/05/2018
 
| 15/05/2018
| 29/05/2018
+
| 17/05/2018
 
| CMS_Data Transfers
 
| CMS_Data Transfers
 
| Likely corrupted File at T1_UK_RAL
 
| Likely corrupted File at T1_UK_RAL
Line 182: Line 152:
 
| urgent
 
| urgent
 
| 23/04/2018
 
| 23/04/2018
| 25/05/2018
+
| 18/05/2018
 
| CMS_Data Transfers
 
| CMS_Data Transfers
 
| Transfer failing from RAL_Disk
 
| Transfer failing from RAL_Disk
Line 202: Line 172:
 
| top priority
 
| top priority
 
| 09/04/2018
 
| 09/04/2018
| 25/05/2018
+
| 18/05/2018
 
| CMS_AAA WAN Access
 
| CMS_AAA WAN Access
 
| Xrootd redirector not seeing some files in ECHO
 
| Xrootd redirector not seeing some files in ECHO
Line 212: Line 182:
 
| less urgent
 
| less urgent
 
| 12/03/2018
 
| 12/03/2018
| 22/05/2018
+
| 19/04/2018
 
| File Transfer
 
| File Transfer
 
| RAL-LCG2-ECHO:  No such file or directory
 
| RAL-LCG2-ECHO:  No such file or directory
Line 271: Line 241:
 
! Scope
 
! Scope
 
|-
 
|-
| 135001
+
| 135164
 +
| none
 +
| verified
 +
| top priority
 +
| 16/05/2018
 +
| 16/05/2018
 +
| Other
 +
| This TEST ALARM has been raised for testing GGUS alarm work flow after a new GGUS release.
 +
| WLCG
 +
|-
 +
| 134853
 
| cms
 
| cms
 
| closed
 
| closed
 
| urgent
 
| urgent
| 09/05/2018
+
| 02/05/2018
| 24/05/2018
+
| 16/05/2018
| CMS_Data Transfers
+
| CMS_Facilities
| Fts-client needs to be updated
+
| T1_UK_RAL HammerCloud failures
 +
| WLCG
 +
|-
 +
| 134839
 +
| snoplus.snolab.ca
 +
| closed
 +
| urgent
 +
| 30/04/2018
 +
| 16/05/2018
 +
| File Transfer
 +
| Data Transfer Failure
 +
| EGI
 +
|-
 +
| 134737
 +
| cms
 +
| closed
 +
| urgent
 +
| 25/04/2018
 +
| 14/05/2018
 +
| CMS_SAM tests
 +
| SAM CE test failing at T1_UK_RAL
 +
| WLCG
 +
|-
 +
| 134494
 +
| atlas
 +
| closed
 +
| urgent
 +
| 11/04/2018
 +
| 17/05/2018
 +
| Storage Systems
 +
| json space reporting not updated
 
| WLCG
 
| WLCG
 
|}
 
|}
Line 309: Line 319:
 
! Comments
 
! Comments
 
|-
 
|-
| 2018-05-22
+
| 2018-05-14
 +
| 100
 
| 100
 
| 100
 
| 100
 
| 100
Line 315: Line 326:
 
| 100
 
| 100
 
| 100
 
| 100
| style="background-color: red;" | 60
 
 
|  
 
|  
 
|-
 
|-
| 2018-05-23
+
| 2018-05-15
 
| 100
 
| 100
 
| 100
 
| 100
 +
| 99
 
| 100
 
| 100
 +
| style="background-color: orange;" | 96
 
| 100
 
| 100
| 100
 
| style="background-color: red;" | 0
 
 
|  
 
|  
 
|-
 
|-
| 2018-05-24
+
| 2018-05-16
 +
| 100
 
| 100
 
| 100
 
| 100
 
| 100
Line 333: Line 344:
 
| 100
 
| 100
 
| 100
 
| 100
| style="background-color: red;" | 0
 
 
|  
 
|  
 
|-
 
|-
| 2018-05-25
+
| 2018-05-17
 
| 100
 
| 100
 
| 100
 
| 100
 +
| 98
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
| style="background-color: red;" | 0
 
 
|  
 
|  
 
|-
 
|-
| 2018-05-26
+
| 2018-05-18
 
| 100
 
| 100
 
| 100
 
| 100
 +
| 98
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
| style="background-color: red;" | 0
 
 
|  
 
|  
 
|-
 
|-
| 2018-05-27
+
| 2018-05-19
 
| 100
 
| 100
 
| 100
 
| 100
| 99
+
| style="background-color: orange;" | 96
 +
| 100
 
| 100
 
| 100
 
| 100
 
| 100
| style="background-color: red;" | 0
 
 
|  
 
|  
 
|-
 
|-
| 2018-05-28
+
| 2018-05-20
 +
| 98
 
| 100
 
| 100
 
| 100
 
| 100
Line 369: Line 380:
 
| 100
 
| 100
 
| 100
 
| 100
| style="background-color: red;" | 0
 
 
|  
 
|  
 
|}
 
|}
Line 390: Line 400:
 
! Day !! Atlas HC !! CMS HC !! Comment
 
! Day !! Atlas HC !! CMS HC !! Comment
 
|-
 
|-
| 2018/05/15 || style="background-color: red;" | 80|| style="background-color: red;" | 82 ||   
+
| 2018/05/22 || 98|| 100 ||   
 
|-
 
|-
| 2018/05/16 || 98 || 92 ||  
+
| 2018/05/23 || 98 || 98 ||  
 
|-
 
|-
| 2018/05/17 || style="background-color: orange;" | 94 || 98 ||  
+
| 2018/05/24 || 97 || 99 ||  
 
|-
 
|-
| 2018/05/18 || 97 || 99 ||  
+
| 2018/05/25 || style="background-color: orange;" | 96 || 99 ||  
 
|-
 
|-
| 2018/05/19 || style="background-color: orange;" | 94 || 99 ||  
+
| 2018/05/26 || 98 || style="background-color: red;" | 56 ||  
 
|-
 
|-
| 2018/05/20 || 98 || - ||  
+
| 2018/05/27 || 100 || style="background-color: red;" | 60 ||  
 
|-
 
|-
| 2018/05/21 || style="background-color: red;" | 80 || - ||  
+
| 2018/05/28 || style="background-color: orange;" | 93 || 100 ||  
 
|-
 
|-
 
|}  
 
|}  

Revision as of 12:28, 30 May 2018

RAL Tier1 Operations Report for 28th May 2018

Review of Issues during the week 14th May to the 21st May 2018.
  • No incidents(major or minor), have been flagged during this reporting period.
Current operational status and issues
  • None
Resolved Castor Disk Server Issues
  • gdss732 (lhcbDst- D1T0) - Back in production after completion of rebuilding of the replacement drive.
Ongoing Castor Disk Server Issues
  • None
Limits on concurrent batch system jobs.
  • CMS Multicore 550
Notable Changes made since the last meeting.
  • None.
Entries in GOC DB starting since the last report.
  • None
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Castor:
    • Update systems to use SL7 and configured by Quattor/Aquilon. (Tape servers done)
    • Move to generic Castor headnodes.
  • Networking
    • Extend the number of services on the production network with IPv6 dual stack. (Done for Perfsonar, FTS3, all squids and the CVMFS Stratum-1 servers).
  • Internal
    • DNS servers will be rolled out within the Tier1 network.
  • Infrastructure
    • Testing of power distribution boards in the R89 machine room is being scheduled for some time late July / Early August. The effect of this on our services is being discussed.
Open

GGUS Tickets (Snapshot during morning of the report). The latest ticket snapshot can be found here[1].

Request id Affected vo Status Priority Date of creation Last update Type of problem Subject Scope
135133 cms in progress urgent 15/05/2018 17/05/2018 CMS_Data Transfers Likely corrupted File at T1_UK_RAL WLCG
134703 cms in progress urgent 23/04/2018 18/05/2018 CMS_Data Transfers Transfer failing from RAL_Disk WLCG
134685 dteam in progress less urgent 23/04/2018 02/05/2018 Middleware please upgrade perfsonar host(s) at RAL-LCG2 to CentOS7 EGI
134468 cms waiting for reply top priority 09/04/2018 18/05/2018 CMS_AAA WAN Access Xrootd redirector not seeing some files in ECHO WLCG
133992 atlas in progress less urgent 12/03/2018 19/04/2018 File Transfer RAL-LCG2-ECHO: No such file or directory EGI
127597 cms on hold urgent 07/04/2017 30/04/2018 File Transfer Check networking and xrootd RAL-CERN performance EGI
124876 ops on hold less urgent 07/11/2016 13/11/2017 Operations [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk EGI
117683 none on hold less urgent 18/11/2015 09/05/2018 Information System CASTOR at RAL not publishing GLUE 2 EGI
GGUS Tickets Closed Last week
Request id Affected vo Status Priority Date of creation Last update Type of problem Subject Scope
135164 none verified top priority 16/05/2018 16/05/2018 Other This TEST ALARM has been raised for testing GGUS alarm work flow after a new GGUS release. WLCG
134853 cms closed urgent 02/05/2018 16/05/2018 CMS_Facilities T1_UK_RAL HammerCloud failures WLCG
134839 snoplus.snolab.ca closed urgent 30/04/2018 16/05/2018 File Transfer Data Transfer Failure EGI
134737 cms closed urgent 25/04/2018 14/05/2018 CMS_SAM tests SAM CE test failing at T1_UK_RAL WLCG
134494 atlas closed urgent 11/04/2018 17/05/2018 Storage Systems json space reporting not updated WLCG
Availability Report
Target Availability for each site is 97.0% Red <90% Orange <97%
Day Atlas Atlas-Echo CMS LHCB Alice OPS Comments
2018-05-14 100 100 100 100 100 100
2018-05-15 100 100 99 100 96 100
2018-05-16 100 100 100 100 100 100
2018-05-17 100 100 98 100 100 100
2018-05-18 100 100 98 100 100 100
2018-05-19 100 100 96 100 100 100
2018-05-20 98 100 100 100 100 100
Hammercloud Test Report
Target Availability for each site is 97.0% Red <90% Orange <97%

Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud

Day Atlas HC CMS HC Comment
2018/05/22 98 100
2018/05/23 98 98
2018/05/24 97 99
2018/05/25 96 99
2018/05/26 98 56
2018/05/27 100 60
2018/05/28 93 100
Notes from Meeting.
  • None yet