RAL Tier1 Operations Report for 19th December 2012
Review of Issues during the week
12th to 19th December 2012
|
- On Tuesday (18th December) there was failure of one of the site routers that took the entire Tier1 off-air at 06:45. The Router was fixed and the configuration verified around 3 hours later. Following this there was a period of verifying various systems and connection within the Tier1 and the outage was ended in the GOC DB at 10:45. Some problems were reported with the batch system after this and these were resolved finally around 15:00.
Resolved Disk Server Issues
|
- GDSS443 (AtlasDataDisk - D1T0) failed with a read only filesystem on Thursday 13th Dec. It was returned to production the next day. One disk was found to be faulty.
- GDSS449 (AtlasDataDisk - D1T0) failed with a read only filesystem on Sunday 16th Dec. It was returned to production the next day.
Current operational status and issues
|
- We have seen an increasing rate of failures on one of the '08 batches of disk servers. A program of upgrading the disk controller firmware in this batch is under way.
- The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
- On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. One half of the new switchboard has been refurbished and was brought into service on 17 September. The work on the second is over-running slightly with an estimated completion of date of 13th January. (Original date was 18th Dec.)
- High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
- Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
Ongoing Disk Server Issues
|
- GDSS447 (AtlasDataDisk - D1T0) failed with a read only filesystem last night and is undergoing investigation.
Notable Changes made this last week
|
- On Monday (17th Dec) the Castor Information provider was upgraded to fix an issue where one of LHCb's paths was showing as undefined.
- The Post mortem report for the Power Incident on 20th November has been prepared and is available at: RAL_Tier1_Incident_20121120_UPS_Over_Voltage
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
- Improve the stack 13 uplink
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services:
- Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
Entries in GOC DB starting between 12th and 19th December 2012
|
There were four unscheduled outages in the GOC DB for this period. Three were for the problem with the Atlas SRMs last week (Wed 12th Dec). The other was the site outage caused by the Network Router failure yesterday morning (18th Dec.)
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Whole site
|
UNSCHEDULED
|
OUTAGE
|
18/12/2012 06:45
|
18/12/2012 10:45
|
4 hours
|
Hardware failure in core site network has taken RAL Tier1 off-air.
|
srm-atlas.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
OUTAGE
|
12/12/2012 13:30
|
12/12/2012 14:57
|
1 hour and 27 minutes
|
Ongoing problem with Atlas SRM being investigated.
|
srm-atlas.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
OUTAGE
|
12/12/2012 11:45
|
12/12/2012 13:30
|
1 hour and 45 minutes
|
Ongoing problems with Atlas SRM.
|
srm-atlas.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
OUTAGE
|
12/12/2012 10:30
|
12/12/2012 11:45
|
1 hour and 15 minutes
|
There are problems with the Atlas srm Database.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
89733
|
Red
|
Urgent
|
In Progress
|
2012-12-17
|
2012-12-18
|
|
RAL bdii giving out incorrect information
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2012-10-31
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
12/12/12 |
100 |
100 |
80.6 |
100 |
100 |
Problems with Atlas SRM.
|
13/12/12 |
100 |
98.6 |
100 |
100 |
100 |
Timeout for the job exceeded.
|
14/12/12 |
100 |
100 |
100 |
100 |
100 |
|
15/12/12 |
100 |
100 |
100 |
100 |
100 |
|
16/12/12 |
100 |
100 |
100 |
95.9 |
100 |
Single SRM test failure "user timeout".
|
17/12/12 |
100 |
100 |
99.2 |
100 |
100 |
Single error while deleting test file.
|
18/12/12 |
71.2 |
76.0 |
63.7 |
64.7 |
87.5 |
Site Network problem (Router A failure) followed by some CE problems.
|