Difference between revisions of "Tier1 Operations Report 2018-10-23"

From GridPP Wiki
Jump to: navigation, search
()
()
Line 241: Line 241:
 
| -
 
| -
 
|}
 
|}
* No ongoing downtime
+
* <s>No ongoing downtime</s>
 
* No downtime scheduled in the GOCDB for next 2 weeks  
 
* No downtime scheduled in the GOCDB for next 2 weeks  
 
<!-- **********************End GOC DB Entries************************** ----->
 
<!-- **********************End GOC DB Entries************************** ----->

Revision as of 09:41, 23 October 2018

RAL Tier1 Operations Report for 23rd October 2018

Review of Issues during the week 15th October 2018 to the 23rd October 2018.
  • The batch-farm is currently (23/10/18), running at reduced capacity (~20%) to facilitate critical kernel patching and/or rebooting of WN's. This should hopefully be completed by midday.
  • No other major issues of note to report for this week.
Current operational status and issues
  • Advance notification of downtime for 25% of our acr-ce's. We will be draining and migrating to new virtualized platform.
Resolved Castor Disk Server Issues
Machine VO DiskPool dxtx Comments
gdss747 Atlas atlasStripInput d1t0 Currently in intervention.
Ongoing Castor Disk Server Issues
Machine VO DiskPool dxtx Comments
gdss743 Atlas ATLASDATADISK,ATLASMCDISK,ATLASGROUPDISK d1t0 Waiting decommissioning .
gdss744 Atlas ATLASDATADISK,ATLASMCDISK,ATLASGROUPDISK d1t0 Waiting decommissioning .
gdss746 Atlas ATLASDATADISK,ATLASMCDISK,ATLASGROUPDISK d1t0 Waiting decommissioning .
gdss748 Atlas ATLASDATADISK,ATLASMCDISK,ATLASGROUPDISK d1t0 Waiting decommissioning .
gdss749 Atlas ATLASDATADISK,ATLASMCDISK,ATLASGROUPDISK d1t0 Waiting decommissioning .
gdss751 Atlas ATLASDATADISK,ATLASMCDISK,ATLASGROUPDISK d1t0 Waiting decommissioning .
gdss754 Atlas ATLASDATADISK,ATLASMCDISK,ATLASGROUPDISK d1t0 Waiting decommissioning .
gdss764 Atlas ATLASDATADISK,ATLASMCDISK,ATLASGROUPDISK d1t0 Waiting decommissioning .
gdss767 Atlas ATLASDATADISK,ATLASMCDISK,ATLASGROUPDISK d1t0 Waiting decommissioning .
gdss768 Atlas ATLASDATADISK,ATLASMCDISK,ATLASGROUPDISK d1t0 Waiting decommissioning .
gdss769 Atlas ATLASDATADISK,ATLASMCDISK,ATLASGROUPDISK d1t0 Waiting decommissioning .
gdss770 Atlas ATLASDATADISK,ATLASMCDISK,ATLASGROUPDISK d1t0 Waiting decommissioning .
gdss781 Atlas ATLASDATADISK,ATLASMCDISK,ATLASGROUPDISK d1t0 Waiting decommissioning .
gdss782 Atlas ATLASDATADISK,ATLASMCDISK,ATLASGROUPDISK d1t0 Waiting decommissioning .
gdss782 LHCb LHCb_FAILOVER,LHCb-Disk d1t0 Currently in intervention .


Limits on concurrent batch system jobs.
  • None currently enforced.
Notable Changes made since the last meeting.
  • None.
Entries in GOC DB starting since the last report.
Service ID Scheduled? Outage/At Risk Start End Duration Reason
- - - - - - - -
Declared in the GOC DB
Service ID Scheduled? Outage/At Risk Start End Duration Reason
- - - - - - - -
  • No ongoing downtime
  • No downtime scheduled in the GOCDB for next 2 weeks
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Castor:
    • Disk only storage will end in Castor. A new single tape-only Castor instance (called WLCGTape) is being tested. This is using generic Catsor headnodes on SL7 configured by Quattor/Aquilon with a slightly newer Castor version.
    • Update systems to use SL7 and configured by Quattor/Aquilon. (Tape servers done)
  • Internal
    • DNS servers will be rolled out within the Tier1 network.
Open

GGUS Tickets (Snapshot taken during morning of the meeting).

Request id Affected vo Status Priority Date of creation Last update Type of problem Subject Scope
137881 lhcb in progress urgent 22/10/2018 23/10/2018 Other Low level of running jobs at RAL-LCG2 WLCG
137822 lhcb in progress top priority 18/10/2018 22/10/2018 File Transfer FTS server seems in bad state WLCG
137752 other in progress less urgent 15/10/2018 19/10/2018 VO Specific Software Replicate OSG CVMFS repositories to EGI stratum 1s EGI
137650 cms waiting for reply urgent 09/10/2018 22/10/2018 CMS_AAA WAN Access Low HC xrootd success rates at T1_UK_RAL WLCG
137195 ops in progress less urgent 14/09/2018 15/10/2018 Operations [Rod Dashboard] Issues detected at RAL-LCG2 EGI
137153 t2k.org in progress urgent 12/09/2018 10/10/2018 Data Management - generic LFC entry has file size 0, preventsw registering of additional replicas EGI
136701 lhcb in progress very urgent 14/08/2018 17/10/2018 File Transfer background of transfer errors WLCG
136199 lhcb in progress very urgent 18/07/2018 17/10/2018 File Transfer Lots of submitted transfers on RAL FTS WLCG
124876 ops waiting for reply less urgent 07/11/2016 22/10/2018 Operations [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk EGI
GGUS Tickets Closed Last week
Request id Affected vo Status Priority Date of creation Last update Type of problem Subject Scope
137792 cms solved urgent 17/10/2018 17/10/2018 CMS_SAM tests Site readiness error and SAM3 SRM critical for T1_UK_RAL WLCG
137791 atlas solved urgent 16/10/2018 17/10/2018 File Transfer RAL-LCG2-ECHO: TRANSFER globus_ftp_control: The certificate has been revoked WLCG
137788 cms solved urgent 16/10/2018 17/10/2018 CMS_Facilities T1_UK_RAL SRM tests failing WLCG
137723 cms solved urgent 14/10/2018 15/10/2018 CMS_Data Transfers PhEDEx component Agent Watchdog donw at T1_UK-RAL WLCG
137699 dteam verified top priority 11/10/2018 15/10/2018 Monitoring Test of RAL-LCG2 Alarm Ticket Handling WLCG
137619 cms closed urgent 07/10/2018 22/10/2018 CMS_AAA WAN Access T1_UK_RAL xrootd read failures WLCG
137565 atlas closed less urgent 03/10/2018 19/10/2018 Other failing handshake for transfers from CA-VICTORIA-WESTGRID-T2_DATADISK to UK RAL-LCG2-ECHO WLCG
137498 cms closed urgent 01/10/2018 22/10/2018 CMS_AAA WAN Access Xrootd FileOpenErrors in production jobs WLCG
137398 cms closed urgent 26/09/2018 17/10/2018 CMS_Data Transfers Transfers failing from SPRACE to RAL - No data available WLCG
136840 snoplus.snolab.ca closed very urgent 23/08/2018 17/10/2018 Other Cannot upload files to LFN from Storage node EGI

Availability Report

Target Availability for each site is 97.0% Red <90% Orange <97%
Day Atlas Atlas-Echo CMS LHCB Alice OPS Comments
2018-10-15 100 100 100 100 100 100
2018-10-16 100 100 73 100 100 100
2018-10-17 100 100 65 100 100 100
2018-10-18 100 100 99 100 100 100
2018-10-19 100 100 100 100 100 100
2018-10-20 100 100 100 100 100 100
2018-10-21 100 100 100 100 100 100
2018-10-22 100 100 99 100 100 100
2018-10-23 100 100 100 100 100 100
Hammercloud Test Report
Target Availability for each site is 97.0% Red <90% Orange <97%
Day Atlas HC CMS HC Comment
2018-10-15 100 98
2018-10-16 87 99
2018-10-17 73 98
2018-10-18 100 100
2018-10-19 100 98
2018-10-20 100 99
2018-10-21 100 99
2018-10-22 100 99

Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud

Notes from Meeting.