RAL Tier1 Operations Report for 5th December 2012
Review of Issues during the week 28th November to 5th December 2012
|
- Overnight Wed/Thu 28/29 Nov there was a problem that affected all of Castor caused by a crash of the Castor permissions database. This was resolved by the Castor On-Call Team during the night.
- On Thursday 29th Nov there were problems with the Castor CMS instance traced to a database problem. Fixed at the end of the afternoon by moving the CMS castor database to a different node.
- On Friday (30th Nov) there was a problem with the Atlas Frontier service with both nodes unavailable. Atlas raised a GGUS ticket. The problem was fixed and the monitoring of this service is being improved.
- On Monday (3rd Dec) a high rate of failures for Atlas Castor were seen. This was fixed by a bounce of the Atlas Castor database at around 13:00 that day.
- On Tuesday morning (4th Dec), around 08:00 local time, there was a transitory problem that caused a high rate of Castor SRM failures (seen in the FTS). The root of the problem has not been definitively identified but appears to be a network issue.
Resolved Disk Server Issues
|
- GDSS673 (CMSTape - D0T1) crashed on Tuesday morning, 27th Nov. It was returned to production on Saturday (1st Dec) following a firmware update (required to help identify a faulty disk within the array) and RAID array verification.
- GDSS647 (LHCbDst - D1T0) failed on Thursday (29th) with a problem on a system partition. returned to service on Monday (3rd Dec).
- GDSS661 (AtlasDataDisk - D1T0) crashed on Saturday (1st Dec) - returned to service on Monday (3rd Dec).
Current operational status and issues
|
- Following the power incident the Tier1 has been running with reduced resilience, particularly as regards power supplies for the fibrechannel SAN switches used in the database infrastructure. This particular issue is now resolved. Work continues to replace and re-stock items such as power supplies and PDUs.
- There is an ongoing problem with Castor Atlas and GEN stager daemons using memory. A regular re-starter is now in place for this daemon and further investigations are taking place with assistance from the Castor developers.
- The batch server process sometimes consumed memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
- Following checks made on 20th November (at the time of the power incident) it is believed the diesel generator should now work in the event of a further power cut. However, this has not yet been tested. A test (To be confirmed) is proposed for Tuesday 11th December.
- On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
- High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
- Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
Ongoing Disk Server Issues
|
Notable Changes made this last week
|
- On Tuesday (4th Dec) replacement power supplies for the fibrechannel SAN switches used in the database infrastructure were obtained and installed. This removes the most significant resilience issue remaining after the power incident of 20th November.
- The final two batches of worker nodes were drained over the weekend and upgraded to EMI-2 (SL5) on Monday (3rd Dec).
- The final two batches of worker nodes had their overcommit increased to make use of hyperthreading yesterday (4th Dec.)
- Thursday 6th December. Warning on Castor 'GEN' instance while debugging Castor stager memory leak.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Networking:
- Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
- Improve the stack 13 uplink
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services:
- Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
- Infrastructure:
- Test of move to diesel power in event of power loss. (Proposed - Tuesday 11th December).
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
Entries in GOC DB starting between 28th November and 5th December 2012
|
There were no unscheduled outages in the GOC DB for this period.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lcgui02.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
28/11/2012 10:00
|
28/11/2012 12:00
|
2 hours
|
Re-install with EMI software version (Upgrade postponed from last week).
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
88596
|
Red
|
Very Urgent
|
In Progress
|
2012-10-19
|
2012-12-01
|
T2K
|
Jobs don't get delgated to RAL
|
86690
|
Red
|
Urgent
|
In Progress
|
2012-10-03
|
2012-12-04
|
T2K
|
JPKEKCRC02 missing from FTS ganglia metrics
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2012-10-31
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
28/11/12 |
100 |
100 |
100 |
95.7 |
100 |
Single failure of SRM test "User timeout over"
|
29/11/12 |
96.9 |
100 |
100 |
62.5 |
91.9 |
Castor permission DB crashed plus for CMS another DB problem.
|
30/11/12 |
100 |
100 |
100 |
100 |
95.8 |
Single SRM test failure. Probably caused by reboot of Router A.
|
01/12/12 |
100 |
100 |
100 |
100 |
100 |
|
02/12/12 |
100 |
100 |
100 |
100 |
100 |
|
03/12/12 |
100 |
100 |
99.1 |
100 |
100 |
Single SRM test failure. Database problem.
|
04/12/12 |
100 |
100 |
99.5 |
95.9 |
100 |
Single failures of SRM tests. Transient network problem.
|