Difference between revisions of "Tier1 Operations Report 2015-12-09"

From GridPP Wiki
Jump to: navigation, search
()
()
Line 57: Line 57:
 
|}
 
|}
 
* Work has continued towards removing the old core network switch:
 
* Work has continued towards removing the old core network switch:
** On Tuesday morning (8th Dec) the network link to the UKLight router was moved. As part of this the links was increased from 2*10Gbit to 4*10Gbit. (But note one of these four has since given some problems).
+
** On Tuesday morning (8th Dec) the network link to the UKLight router was moved. As part of this the connection was increased from 2*10Gbit to 4*10Gbit. (But note one of these four has since given some problems).
 
** This morning the link to the Castor headnodes was moved.
 
** This morning the link to the Castor headnodes was moved.
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- *************End Notable Changes made this last week************** ----->

Revision as of 10:08, 9 December 2015

RAL Tier1 Operations Report for 9th December 2015

Review of Issues during the week 2nd to 9th December 2015.
  • We ran with a single 10Gbit link between the Tier1 core network and the UKLight router overnight Thursday-Friday 26-27 Nov. The link was saturating. The problem was fixed during the Friday morning with the second 10Gbit link being re-established.
  • Three disk servers in AtlasDataDisk crashed during last weekend. This appears to be coincidence - this is our largest disk pool. The types of crash appear similar being in the CPUs. A BIOS update is being applied as this may improve the handling of this type of error. This morning these servers were briefly taken down one after the other for a BIOS update (which may improve the handling of this type of fault).
Resolved Disk Server Issues
  • None
Current operational status and issues
  • There is a problem seen by LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites. A recent modification has improved, but not completed fixed this.
  • The intermittent, low-level, load-related packet loss seen over external connections is still being tracked. Likewise we have been working to understand some remaining low level of packet loss seen within a part of our Tier1 network.
Ongoing Disk Server Issues
  • GDSS687 (AtlasDataDisk - D1T0) failed on Friday (4th December) with a read-only filesystem. Taken out of production. It was returned to service in read-only mode the following day. The problem looks to have been was triggered by a disk failure.
  • GDSS675 (CMSTape - D0T1) failed in the early hours of yesterday morning (8th Dec). This also reported a read-only file system.
  • GDSS620 (GenTape - D0T1) also failed during the early morning yesterday (8th Dec) - also with a read-only file system.
Notable Changes made since the last meeting.
  • Work has continued towards removing the old core network switch:
    • On Tuesday morning (8th Dec) the network link to the UKLight router was moved. As part of this the connection was increased from 2*10Gbit to 4*10Gbit. (But note one of these four has since given some problems).
    • This morning the link to the Castor headnodes was moved.
Declared in the GOC DB
Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor (All SRM endpoints) SCHEDULED WARNING 09/12/2015 09:30 09/12/2015 10:30 1 hour Warning on Castor services for short network reconfiguration.
All Castor (All SRM endpoints) SCHEDULED WARNING 08/12/2015 09:00 08/12/2015 10:00 1 hour Warning on Castor services during short network reconfiguration that affects the external data path to/from the Castor storage.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Upgrade of remaining Castor disk servers (those in tape-backed service classes) to SL6. This will be transparent to users.
  • There is one final step to be made to remove the old 'core' switch from our network. This is expected to be completed this week.
  • Roll-out of the changed algorithm for the draining of worker nodes to make space for multi-core jobs. The new version allow "pre-emptable" jobs to run in the job slots until they are needed.

Listing by category:

  • Databases:
    • Switch LFC/3D to new Database Infrastructure.
  • Castor:
    • Update SRMs to new version (includes updating to SL6).
    • Update disk servers to SL6 (ongoing)
    • Update to Castor version 2.1.15.
  • Networking:
    • Complete changes needed to remove the old core switch from the Tier1 network.
    • Make routing changes to allow the removal of the UKLight Router.
  • Fabric
    • Firmware updates on remaining EMC disk arrays (Castor, LFC)
Entries in GOC DB starting since the last report.
Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor SCHEDULED WARNING 09/12/2015 09:30 09/12/2015 10:30 1 hour Warning on Castor services for short network reconfiguration.
All Castor SCHEDULED WARNING 08/12/2015 09:00 08/12/2015 10:00 1 hour Warning on Castor services during short network reconfiguration that affects the external data path to/from the Castor storage.
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
118044 Green Less Urgent In Progress 2015-11-30 2015-11-30 Atlas gLExec hammercloud jobs failing at RAL-LCG2 since October
117846 Green Urgent In Progress 2015-11-23 2015-11-24 Atlas ATLAS request- storage consistency checks
117683 Green Less Urgent In Progress 2015-11-18 2015-11-19 CASTOR at RAL not publishing GLUE 2
116866 Yellow Less Urgent In Progress 2015-10-12 2015-11-30 SNO+ snoplus support at RAL-LCG2 (pilot role)
116864 Amber Urgent In Progress 2015-10-12 2015-11-20 CMS T1_UK_RAL AAA opening and reading test failing again...
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
02/12/15 100 100 100 100 100 100 100
03/12/15 100 100 100 100 100 100 100
04/12/15 100 100 100 100 100 100 N/A
05/12/15 100 100 100 100 100 100 100
06/12/15 100 100 97 100 100 100 100 Single SRM Test failure on Put: (Unable to issue PrepareToPut request to Castor)
07/12/15 100 100 100 100 100 100 100
08/12/15 100 100 100 100 96 100 100 Single SRM Tes failure on List: (No such file or directory.)