Difference between revisions of "Tier1 Operations Report 2015-12-09"
From GridPP Wiki
(→) |
(→) |
||
Line 109: | Line 109: | ||
<!-- ******* still to be formally scheduled and/or announced ******* -----> | <!-- ******* still to be formally scheduled and/or announced ******* -----> | ||
* Upgrade of remaining Castor disk servers (those in tape-backed service classes) to SL6. This will be transparent to users. | * Upgrade of remaining Castor disk servers (those in tape-backed service classes) to SL6. This will be transparent to users. | ||
− | * | + | * There is one final step to be made to remove the old 'core' switch from our network. This is expected to be completed this week. |
* Roll-out of the changed algorithm for the draining of worker nodes to make space for multi-core jobs. The new version allow "pre-emptable" jobs to run in the job slots until they are needed. | * Roll-out of the changed algorithm for the draining of worker nodes to make space for multi-core jobs. The new version allow "pre-emptable" jobs to run in the job slots until they are needed. | ||
'''Listing by category:''' | '''Listing by category:''' |
Revision as of 10:08, 9 December 2015
RAL Tier1 Operations Report for 9th December 2015
Review of Issues during the week 2nd to 9th December 2015. |
- We ran with a single 10Gbit link between the Tier1 core network and the UKLight router overnight Thursday-Friday 26-27 Nov. The link was saturating. The problem was fixed during the Friday morning with the second 10Gbit link being re-established.
- Three disk servers in AtlasDataDisk crashed during last weekend. This appears to be coincidence - this is our largest disk pool. The types of crash appear similar being in the CPUs. A BIOS update is being applied as this may improve the handling of this type of error. This morning these servers were briefly taken down one after the other for a BIOS update (which may improve the handling of this type of fault).
Resolved Disk Server Issues |
- None
Current operational status and issues |
- There is a problem seen by LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites. A recent modification has improved, but not completed fixed this.
- The intermittent, low-level, load-related packet loss seen over external connections is still being tracked. Likewise we have been working to understand some remaining low level of packet loss seen within a part of our Tier1 network.
Ongoing Disk Server Issues |
- GDSS687 (AtlasDataDisk - D1T0) failed on Friday (4th December) with a read-only filesystem. Taken out of production. It was returned to service in read-only mode the following day. The problem looks to have been was triggered by a disk failure.
- GDSS675 (CMSTape - D0T1) failed in the early hours of yesterday morning (8th Dec). This also reported a read-only file system.
- GDSS620 (GenTape - D0T1) also failed during the early morning yesterday (8th Dec) - also with a read-only file system.
Notable Changes made since the last meeting. |
- Work has continued towards removing the old core network switch:
- On Tuesday morning (8th Dec) the network link to the UKLight router was moved. As part of this the links was increased from 2*10Gbit to 4*10Gbit. (But note one of these four has since given some problems).
- This morning the link to the Castor headnodes was moved.
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
All Castor (All SRM endpoints) | SCHEDULED | WARNING | 09/12/2015 09:30 | 09/12/2015 10:30 | 1 hour | Warning on Castor services for short network reconfiguration. |
All Castor (All SRM endpoints) | SCHEDULED | WARNING | 08/12/2015 09:00 | 08/12/2015 10:00 | 1 hour | Warning on Castor services during short network reconfiguration that affects the external data path to/from the Castor storage. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Upgrade of remaining Castor disk servers (those in tape-backed service classes) to SL6. This will be transparent to users.
- There is one final step to be made to remove the old 'core' switch from our network. This is expected to be completed this week.
- Roll-out of the changed algorithm for the draining of worker nodes to make space for multi-core jobs. The new version allow "pre-emptable" jobs to run in the job slots until they are needed.
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Update disk servers to SL6 (ongoing)
- Update to Castor version 2.1.15.
- Networking:
- Complete changes needed to remove the old core switch from the Tier1 network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, LFC)
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
All Castor | SCHEDULED | WARNING | 09/12/2015 09:30 | 09/12/2015 10:30 | 1 hour | Warning on Castor services for short network reconfiguration. |
All Castor | SCHEDULED | WARNING | 08/12/2015 09:00 | 08/12/2015 10:00 | 1 hour | Warning on Castor services during short network reconfiguration that affects the external data path to/from the Castor storage. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
118044 | Green | Less Urgent | In Progress | 2015-11-30 | 2015-11-30 | Atlas | gLExec hammercloud jobs failing at RAL-LCG2 since October |
117846 | Green | Urgent | In Progress | 2015-11-23 | 2015-11-24 | Atlas | ATLAS request- storage consistency checks |
117683 | Green | Less Urgent | In Progress | 2015-11-18 | 2015-11-19 | CASTOR at RAL not publishing GLUE 2 | |
116866 | Yellow | Less Urgent | In Progress | 2015-10-12 | 2015-11-30 | SNO+ | snoplus support at RAL-LCG2 (pilot role) |
116864 | Amber | Urgent | In Progress | 2015-10-12 | 2015-11-20 | CMS | T1_UK_RAL AAA opening and reading test failing again... |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
02/12/15 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
03/12/15 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
04/12/15 | 100 | 100 | 100 | 100 | 100 | 100 | N/A | |
05/12/15 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
06/12/15 | 100 | 100 | 97 | 100 | 100 | 100 | 100 | Single SRM Test failure on Put: (Unable to issue PrepareToPut request to Castor) |
07/12/15 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
08/12/15 | 100 | 100 | 100 | 100 | 96 | 100 | 100 | Single SRM Tes failure on List: (No such file or directory.) |