RAL Tier1 Operations Report for 9th December 2015
Review of Issues during the week 2nd to 9th December 2015.
|
- We ran with a single 10Gbit link between the Tier1 core network and the UKLight router overnight Thursday-Friday 26-27 Nov. The link was saturating. The problem was fixed during the Friday morning with the second 10Gbit link being re-established.
- Three disk servers in AtlasDataDisk crashed during last weekend. This appears to be coincidence - this is our largest disk pool. The types of crash appear similar being in the CPUs. A BIOS update is being applied as this may improve the handling of this type of error. This morning these servers were briefly taken down one after the other for a BIOS update (which may improve the handling of this type of fault).
Resolved Disk Server Issues
|
Current operational status and issues
|
- There is a problem seen by LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites. A recent modification has improved, but not completed fixed this.
- The intermittent, low-level, load-related packet loss seen over external connections is still being tracked. Likewise we have been working to understand some remaining low level of packet loss seen within a part of our Tier1 network.
Ongoing Disk Server Issues
|
- GDSS687 (AtlasDataDisk - D1T0) failed on Friday (4th December) with a read-only filesystem. Taken out of production. It was returned to service in read-only mode the following day. The problem looks to have been was triggered by a disk failure.
- GDSS675 (CMSTape - D0T1) failed in the early hours of yesterday morning (8th Dec). This also reported a read-only file system.
- GDSS620 (GenTape - D0T1) also failed during the early morning yesterday (8th Dec) - also with a read-only file system.
Notable Changes made since the last meeting.
|
- Work has continued towards removing the old core network switch:
- On Tuesday morning (8th Dec) the network link to the UKLight router was moved. As part of this the connection was increased from 2*10Gbit to 4*10Gbit. (But note one of these four has since given some problems).
- This morning the link to the Castor headnodes was moved.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
All Castor (All SRM endpoints)
|
SCHEDULED
|
WARNING
|
09/12/2015 09:30
|
09/12/2015 10:30
|
1 hour
|
Warning on Castor services for short network reconfiguration.
|
All Castor (All SRM endpoints)
|
SCHEDULED
|
WARNING
|
08/12/2015 09:00
|
08/12/2015 10:00
|
1 hour
|
Warning on Castor services during short network reconfiguration that affects the external data path to/from the Castor storage.
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Upgrade of remaining Castor disk servers (those in tape-backed service classes) to SL6. This will be transparent to users.
- There is one final step to be made to remove the old 'core' switch from our network. This is expected to be completed this week.
- Roll-out of the changed algorithm for the draining of worker nodes to make space for multi-core jobs. The new version allow "pre-emptable" jobs to run in the job slots until they are needed.
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Update disk servers to SL6 (ongoing)
- Update to Castor version 2.1.15.
- Networking:
- Complete changes needed to remove the old core switch from the Tier1 network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, LFC)
Entries in GOC DB starting since the last report.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
All Castor
|
SCHEDULED
|
WARNING
|
09/12/2015 09:30
|
09/12/2015 10:30
|
1 hour
|
Warning on Castor services for short network reconfiguration.
|
All Castor
|
SCHEDULED
|
WARNING
|
08/12/2015 09:00
|
08/12/2015 10:00
|
1 hour
|
Warning on Castor services during short network reconfiguration that affects the external data path to/from the Castor storage.
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
118044
|
Green
|
Less Urgent
|
In Progress
|
2015-11-30
|
2015-11-30
|
Atlas
|
gLExec hammercloud jobs failing at RAL-LCG2 since October
|
117846
|
Green
|
Urgent
|
In Progress
|
2015-11-23
|
2015-11-24
|
Atlas
|
ATLAS request- storage consistency checks
|
117683
|
Green
|
Less Urgent
|
In Progress
|
2015-11-18
|
2015-11-19
|
|
CASTOR at RAL not publishing GLUE 2
|
116866
|
Yellow
|
Less Urgent
|
In Progress
|
2015-10-12
|
2015-11-30
|
SNO+
|
snoplus support at RAL-LCG2 (pilot role)
|
116864
|
Amber
|
Urgent
|
In Progress
|
2015-10-12
|
2015-11-20
|
CMS
|
T1_UK_RAL AAA opening and reading test failing again...
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
02/12/15 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
03/12/15 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
04/12/15 |
100 |
100 |
100 |
100 |
100 |
100 |
N/A |
|
05/12/15 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
06/12/15 |
100 |
100 |
97 |
100 |
100 |
100 |
100 |
Single SRM Test failure on Put: (Unable to issue PrepareToPut request to Castor)
|
07/12/15 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
08/12/15 |
100 |
100 |
100 |
100 |
96 |
100 |
100 |
Single SRM Tes failure on List: (No such file or directory.)
|