Difference between revisions of "Tier1 Operations Report 2018-10-29"

From GridPP Wiki
Jump to: navigation, search
()
()
 
(10 intermediate revisions by one user not shown)
Line 13: Line 13:
 
* Availability of the CMS-AAA service is improving.  Machines have had their memory doubled, which has stopped the machines going in to swap.  The other problem is a bug which is fixed in a later version of XRootD than we are running.  We will need to recompile XRootD, create an RPM and deploy as appropriate.
 
* Availability of the CMS-AAA service is improving.  Machines have had their memory doubled, which has stopped the machines going in to swap.  The other problem is a bug which is fixed in a later version of XRootD than we are running.  We will need to recompile XRootD, create an RPM and deploy as appropriate.
 
* Patching of machines against CVE-2018-14634 is in progress.  The WN which were the most at risk of being exploited by this vulnerability were done on the 22nd - 23rd October.  We still need to reboot around 500 other machines (total 1500).
 
* Patching of machines against CVE-2018-14634 is in progress.  The WN which were the most at risk of being exploited by this vulnerability were done on the 22nd - 23rd October.  We still need to reboot around 500 other machines (total 1500).
* A Ceph storage node developed a hardware fault and was removed from production on Saturday 27th morning.  No degradation of service / unavailability of data was reported (as expected).
 
 
* OPs tests are finally passing for Echo solving a ticket that was open for nearly 2 years!
 
* OPs tests are finally passing for Echo solving a ticket that was open for nearly 2 years!
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- ***********End Review of Issues during last week*********** ----->
Line 25: Line 24:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues
 
|}
 
|}
* Advance notification of downtime for 25% of our acr-ce's.  We will be draining and migrating to new virtualized platform.
+
* NTR
 
<!-- ***********End Current operational status and issues*********** ----->
 
<!-- ***********End Current operational status and issues*********** ----->
 
<!-- *************************************************************** ----->
 
<!-- *************************************************************** ----->
Line 150: Line 149:
 
! Reason
 
! Reason
 
|-
 
|-
| arc-ce02
+
| srm-atlas
| 26208
+
| 26250
 
| Yes
 
| Yes
 
| Outage
 
| Outage
| 25/10/18 10:00
+
| 31/10/2018 09:00
| 29/10/18 15:00
+
| 31/10/2018 12:00
 
| -
 
| -
| Migration to new virtualized platform.  
+
| Migration of ATLAS to wlcgTape.  
 
|}
 
|}
 
* <s>No ongoing downtime</s>
 
* <s>No ongoing downtime</s>
Line 203: Line 202:
 
! Subject
 
! Subject
 
! Scope
 
! Scope
 +
|-
 +
| style="background-color: green;" | 138002
 +
| cms
 +
| in progress
 +
| top priority
 +
| 30/10/2018
 +
| 30/10/2018
 +
| CMS_Data Transfers
 +
| Issues with RAL FTS
 +
| WLCG
 +
|-
 +
| style="background-color: green;" | 137994
 +
| cms
 +
| in progress
 +
| urgent
 +
| 30/10/2018
 +
| 30/10/2018
 +
| CMS_Data Transfers
 +
| Transfers failing between RAL and T1_FR_CCIN2P3_Disk
 +
| WLCG
 +
|-
 +
| style="background-color: green;" | 137942
 +
| cms
 +
| in progress
 +
| urgent
 +
| 25/10/2018
 +
| 31/10/2018
 +
| CMS_Data Transfers
 +
| Failing transfers via IPv6 between T1_UK_RAL and T1_DE_KIT
 +
| WLCG
 
|-
 
|-
 
| style="background-color: green;" | 137897
 
| style="background-color: green;" | 137897
Line 214: Line 243:
 
| EGI
 
| EGI
 
|-
 
|-
| style="background-color: green;" | 137822
+
| style="background-color: yellow;" | 137822
 
| lhcb
 
| lhcb
 
| in progress
 
| in progress
Line 229: Line 258:
 
| less urgent
 
| less urgent
 
| 15/10/2018
 
| 15/10/2018
| 19/10/2018
+
| 29/10/2018
 
| VO Specific Software
 
| VO Specific Software
 
| Replicate OSG CVMFS repositories to EGI stratum 1s
 
| Replicate OSG CVMFS repositories to EGI stratum 1s
 
| EGI
 
| EGI
 
|-
 
|-
| style="background-color: green;" | 137650
+
| style="background-color: yellow;" | 137650
 
| cms
 
| cms
| waiting for reply
+
| in progress
| urgent
+
| very urgent
 
| 09/10/2018
 
| 09/10/2018
| 24/10/2018
+
| 31/10/2018
 
| CMS_AAA WAN Access
 
| CMS_AAA WAN Access
 
| Low HC xrootd success rates at T1_UK_RAL
 
| Low HC xrootd success rates at T1_UK_RAL
Line 298: Line 327:
 
! Scope
 
! Scope
 
|-
 
|-
| 137792
+
| 137881
| cms
+
| lhcb
| solved
+
| verified
 
| urgent
 
| urgent
| 17/10/2018
+
| 22/10/2018
| 17/10/2018
+
| 24/10/2018
| CMS_SAM tests
+
| Other
| Site readiness error and SAM3 SRM critical for T1_UK_RAL
+
| Low level of running jobs at RAL-LCG2
| WLCG
+
|-
+
| 137791
+
| atlas
+
| solved
+
| urgent
+
| 16/10/2018
+
| 17/10/2018
+
| File Transfer
+
| RAL-LCG2-ECHO: TRANSFER globus_ftp_control: The certificate has been revoked
+
| WLCG
+
|-
+
| 137788
+
| cms
+
| solved
+
| urgent
+
| 16/10/2018
+
| 17/10/2018
+
| CMS_Facilities
+
| T1_UK_RAL SRM tests failing
+
 
| WLCG
 
| WLCG
 
|-
 
|-
 
| 137723
 
| 137723
 
| cms
 
| cms
| solved
+
| closed
 
| urgent
 
| urgent
 
| 14/10/2018
 
| 14/10/2018
| 15/10/2018
+
| 29/10/2018
 
| CMS_Data Transfers
 
| CMS_Data Transfers
 
| PhEDEx component Agent Watchdog donw at T1_UK-RAL
 
| PhEDEx component Agent Watchdog donw at T1_UK-RAL
 
| WLCG
 
| WLCG
 
|-
 
|-
| 137699
+
| style="background-color: green;" | 137634
| dteam
+
| verified
+
| top priority
+
| 11/10/2018
+
| 15/10/2018
+
| Monitoring
+
| Test of RAL-LCG2 Alarm Ticket Handling
+
| WLCG
+
|-
+
| 137619
+
 
| cms
 
| cms
 
| closed
 
| closed
 
| urgent
 
| urgent
| 07/10/2018
+
| 08/10/2018
| 22/10/2018
+
| 24/10/2018
| CMS_AAA WAN Access
+
| CMS_Data Transfers
| T1_UK_RAL xrootd read failures
+
| Transfers failing from T1_UK_RAL_Disk to METU
 
| WLCG
 
| WLCG
 
|-
 
|-
| 137565
+
| style="background-color: green;" | 137391
 
| atlas
 
| atlas
| closed
 
| less urgent
 
| 03/10/2018
 
| 19/10/2018
 
| Other
 
| failing handshake for transfers from CA-VICTORIA-WESTGRID-T2_DATADISK to UK RAL-LCG2-ECHO
 
| WLCG
 
|-
 
| 137498
 
| cms
 
 
| closed
 
| closed
 
| urgent
 
| urgent
| 01/10/2018
+
| 25/09/2018
| 22/10/2018
+
| 24/10/2018
| CMS_AAA WAN Access
+
| Network problem
| Xrootd FileOpenErrors in production jobs
+
| UK RAL-LCG2 transfer errors with Communication error on send
 
| WLCG
 
| WLCG
 
|-
 
|-
| 137398
+
| style="background-color: green;" | 137195
| cms
+
| ops
| closed
+
| verified
| urgent
+
| less urgent
| 26/09/2018
+
| 14/09/2018
| 17/10/2018
+
| 28/10/2018
| CMS_Data Transfers
+
| Operations
| Transfers failing from SPRACE to RAL - No data available
+
| [Rod Dashboard] Issues detected at RAL-LCG2
| WLCG
+
| EGI
 
|-
 
|-
| 136840
+
| style="background-color: red;" | 124876
| snoplus.snolab.ca
+
| ops
| closed
+
| solved
| very urgent
+
| less urgent
| 23/08/2018
+
| 07/11/2016
| 17/10/2018
+
| 24/10/2018
| Other
+
| Operations
| Cannot upload files to LFN from Storage node
+
| [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
 
| EGI
 
| EGI
 
|}
 
|}
Line 427: Line 416:
 
! Comments
 
! Comments
 
|-
 
|-
| 2018-10-15
+
| 2018-10-22
| 100
+
 
| 100
 
| 100
 
| 100
 
| 100
 +
| 99
 
| 100
 
| 100
 
| 100
 
| 100
Line 436: Line 425:
 
|  
 
|  
 
|-
 
|-
| 2018-10-16
+
| 2018-10-23
 
| 100
 
| 100
 
| 100
 
| 100
| style="background-color: red;" | 73
+
| 99
 
| 100
 
| 100
 
| 100
 
| 100
Line 445: Line 434:
 
|  
 
|  
 
|-
 
|-
| 2018-10-17
+
| 2018-10-24
 
| 100
 
| 100
 
| 100
 
| 100
| style="background-color: red;" | 65
+
| 98
 
| 100
 
| 100
 
| 100
 
| 100
Line 454: Line 443:
 
|  
 
|  
 
|-
 
|-
| 2018-10-18
+
| 2018-10-25
 
| 100
 
| 100
 
| 100
 
| 100
| 99
+
| style="background-color: orange;" | 95
 
| 100
 
| 100
 
| 100
 
| 100
Line 463: Line 452:
 
|  
 
|  
 
|-
 
|-
| 2018-10-19
+
| 2018-10-26
| 100
+
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 
| 100
 +
| style="background-color: orange;" | 93
 
| 100
 
| 100
 
|  
 
|  
 
|-
 
|-
| 2018-10-20
+
| 2018-10-27
 
| 100
 
| 100
 
| 100
 
| 100
Line 481: Line 470:
 
|  
 
|  
 
|-
 
|-
| 2018-10-21
+
| 2018-10-28
 
| 100
 
| 100
 
| 100
 
| 100
Line 490: Line 479:
 
|  
 
|  
 
|-
 
|-
| 2018-10-22
+
| 2018-10-29
 
| 100
 
| 100
 
| 100
 
| 100
 
| 99
 
| 99
 
| 100
 
| 100
| 100
+
| style="background-color: red;" | 55
 
| 100
 
| 100
 
|  
 
|  
 
|-
 
|-
| 2018-10-23
+
| 2018-10-30
| 100
+
| 100
+
| 100
+
 
| 100
 
| 100
 
| 100
 
| 100
 +
| 99
 
| 100
 
| 100
 +
| style="background-color: red;" | 66
 
|  
 
|  
 
|}
 
|}
Line 525: Line 513:
 
! Day !! Atlas HC !! CMS HC !! Comment
 
! Day !! Atlas HC !! CMS HC !! Comment
 
|-
 
|-
| 2018-10-15 || 100 || 98 ||   
+
| 2018-10-22 || 100 || 98 ||   
 +
|-
 +
| 2018-10-23 || 100 || style="background-color: orange;" | 97 ||
 
|-
 
|-
| 2018-10-16 || style="background-color: red;" | 87 || 99 ||  
+
| 2018-10-24 || 100 || 99 ||  
 
|-
 
|-
| 2018-10-17 || style="background-color: red;" | 73 || 98 ||  
+
| 2018-10-25 || 100 || 99 ||  
 
|-
 
|-
| 2018-10-18 || 100 || 100 ||  
+
| 2018-10-26 || 100 || 99 ||  
 
|-
 
|-
| 2018-10-19 || 100 || 98 ||  
+
| 2018-10-27 || 100 || 99 ||  
 
|-
 
|-
| 2018-10-20 || 100 || 99 ||  
+
| 2018-10-28 || 98 || 98 ||  
 
|-
 
|-
| 2018-10-21 || 100 || 99 ||  
+
| 2018-10-29 || 100 || 98 ||  
 
|-
 
|-
| 2018-10-22 || 100 || 99 ||  
+
| 2018-10-30 || - || - ||  
 
|-
 
|-
 
|}  
 
|}  

Latest revision as of 12:39, 31 October 2018

RAL Tier1 Operations Report for 30th October 2018

Review of Issues during the week 29th October 2018 to the 30th October 2018.
  • Availability of the CMS-AAA service is improving. Machines have had their memory doubled, which has stopped the machines going in to swap. The other problem is a bug which is fixed in a later version of XRootD than we are running. We will need to recompile XRootD, create an RPM and deploy as appropriate.
  • Patching of machines against CVE-2018-14634 is in progress. The WN which were the most at risk of being exploited by this vulnerability were done on the 22nd - 23rd October. We still need to reboot around 500 other machines (total 1500).
  • OPs tests are finally passing for Echo solving a ticket that was open for nearly 2 years!
Current operational status and issues
  • NTR
Resolved Castor Disk Server Issues
Machine VO DiskPool dxtx Comments
- - - - -
Ongoing Castor Disk Server Issues
Machine VO DiskPool dxtx Comments
- - - - -


Limits on concurrent batch system jobs.
  • None currently enforced.
Notable Changes made since the last meeting.
  • None.
Entries in GOC DB starting since the last report.
Service ID Scheduled? Outage/At Risk Start End Duration Reason
- - - - - - - -
Declared in the GOC DB
Service ID Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas 26250 Yes Outage 31/10/2018 09:00 31/10/2018 12:00 - Migration of ATLAS to wlcgTape.
  • No ongoing downtime
  • No downtime scheduled in the GOCDB for next 2 weeks
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Castor:
    • Disk only storage will end in Castor. A new single tape-only Castor instance (called WLCGTape) is being tested. This is using generic Catsor headnodes on SL7 configured by Quattor/Aquilon with a slightly newer Castor version.
    • Update systems to use SL7 and configured by Quattor/Aquilon. (Tape servers done)
  • Internal
    • DNS servers will be rolled out within the Tier1 network.
Open

GGUS Tickets (Snapshot taken during morning of the meeting).

Request id Affected vo Status Priority Date of creation Last update Type of problem Subject Scope
138002 cms in progress top priority 30/10/2018 30/10/2018 CMS_Data Transfers Issues with RAL FTS WLCG
137994 cms in progress urgent 30/10/2018 30/10/2018 CMS_Data Transfers Transfers failing between RAL and T1_FR_CCIN2P3_Disk WLCG
137942 cms in progress urgent 25/10/2018 31/10/2018 CMS_Data Transfers Failing transfers via IPv6 between T1_UK_RAL and T1_DE_KIT WLCG
137897 enmr.eu in progress urgent 23/10/2018 24/10/2018 Accounting enmr.eu accounting at RAL EGI
137822 lhcb in progress top priority 18/10/2018 22/10/2018 File Transfer FTS server seems in bad state WLCG
137752 other in progress less urgent 15/10/2018 29/10/2018 VO Specific Software Replicate OSG CVMFS repositories to EGI stratum 1s EGI
137650 cms in progress very urgent 09/10/2018 31/10/2018 CMS_AAA WAN Access Low HC xrootd success rates at T1_UK_RAL WLCG
137153 t2k.org in progress urgent 12/09/2018 10/10/2018 Data Management - generic LFC entry has file size 0, preventsw registering of additional replicas EGI
136701 lhcb in progress very urgent 14/08/2018 17/10/2018 File Transfer background of transfer errors WLCG
136199 lhcb in progress very urgent 18/07/2018 17/10/2018 File Transfer Lots of submitted transfers on RAL FTS WLCG
GGUS Tickets Closed Last week
Request id Affected vo Status Priority Date of creation Last update Type of problem Subject Scope
137881 lhcb verified urgent 22/10/2018 24/10/2018 Other Low level of running jobs at RAL-LCG2 WLCG
137723 cms closed urgent 14/10/2018 29/10/2018 CMS_Data Transfers PhEDEx component Agent Watchdog donw at T1_UK-RAL WLCG
137634 cms closed urgent 08/10/2018 24/10/2018 CMS_Data Transfers Transfers failing from T1_UK_RAL_Disk to METU WLCG
137391 atlas closed urgent 25/09/2018 24/10/2018 Network problem UK RAL-LCG2 transfer errors with Communication error on send WLCG
137195 ops verified less urgent 14/09/2018 28/10/2018 Operations [Rod Dashboard] Issues detected at RAL-LCG2 EGI
124876 ops solved less urgent 07/11/2016 24/10/2018 Operations [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk EGI

Availability Report

Target Availability for each site is 97.0% Red <90% Orange <97%
Day Atlas Atlas-Echo CMS LHCB Alice OPS Comments
2018-10-22 100 100 99 100 100 100
2018-10-23 100 100 99 100 100 100
2018-10-24 100 100 98 100 100 100
2018-10-25 100 100 95 100 100 100
2018-10-26 100 100 100 100 93 100
2018-10-27 100 100 100 100 100 100
2018-10-28 100 100 100 100 100 100
2018-10-29 100 100 99 100 55 100
2018-10-30 100 100 99 100 66
Hammercloud Test Report
Target Availability for each site is 97.0% Red <90% Orange <97%
Day Atlas HC CMS HC Comment
2018-10-22 100 98
2018-10-23 100 97
2018-10-24 100 99
2018-10-25 100 99
2018-10-26 100 99
2018-10-27 100 99
2018-10-28 98 98
2018-10-29 100 98
2018-10-30 - -

Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud

Notes from Meeting.