Latest revision as of 13:16, 2 January 2013
RAL Tier1 Operations Report for 2nd January 2013
Review of Issues during the fortnight
19th December 2012 to 2nd January 2013.
|
This period mainly covers the Christmas & New Year Holidays (from Friday 21st Dec to Wednesday 2nd Jan). With the exception of the Atlas Castor database problem (see below) it was a fairly quiet period.
- On Christmas Day (25th Dec) a problem appeared with the Atlas Castor stager and SRM databases. This took some time to track down and resulted in intermittent performance of the Atlas Castor instance until the 27th. The cause was finally traced to a bad error/warning return resulting from a database password that had not expired but was due to do so shortly.
- On Tuesday (1st Jan) at the end of the afternoon one of the four top BDII nodes failed. The Top-BDII ran in a degraded manner until the following morning.
- Over the holiday there were a couple of minor batch issues picked up and fixed by the on-call team although these did not significantly affect batch work.
Resolved Disk Server Issues
|
- GDSS447 (AtlasDataDisk - D1T0) failed with a read only filesystem overnight 18/19 Dec. It was ready to go back into production the next day. However, owing to an error this was not done fully until 24th Dec.
- GDSS449 (AtlasDataDisk - D1T0) failed with a read only filesystem on Tuesday 31st Dec. It was returned to production the next day (1st Jan).
Current operational status and issues
|
- The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
- On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. One half of the new switchboard has been refurbished and was brought into service on 17 September. The work on the second is over-running slightly with an estimated completion of date of 13th January. (Original date was 18th Dec.)
- High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
- Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
Ongoing Disk Server Issues
|
Notable Changes made this last week
|
- On Wednesday/Thursday 19/20th Dec a firmware upgrade was rolled out to one batch of disk servers following a higher rate of problems in that batch.
- The Post mortem report for the Power Incident on 20th November has been prepared and is available at: RAL_Tier1_Incident_20121120_UPS_Over_Voltage (Repeat of information in last report).
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
- Improve the stack 13 uplink
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services:
- Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
Entries in GOC DB starting between 19th December 2012 and 2nd January 2013.
|
There was one unscheduled outage in the GOC DB for this period which is for the Atlas Castor problems that began on Christmas Day.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
srm-atlas
|
UNSCHEDULED
|
OUTAGE
|
25/12/2012 06:00
|
25/12/2012 12:31
|
6 hours and 31 minutes
|
ATLAS SRM database problems
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
89733
|
Red
|
Urgent
|
In Progress
|
2012-12-17
|
2012-12-20
|
|
RAL bdii giving out incorrect information
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2012-10-31
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
19/12/12 |
100 |
100 |
100 |
100 |
100 |
|
20/12/12 |
100 |
100 |
100 |
100 |
100 |
|
21/12/12 |
100 |
100 |
100 |
100 |
100 |
|
22/12/12 |
100 |
89.6 |
89.5 |
86.0 |
100 |
Monitoring problem affected a number of grid sites.
|
23/12/12 |
100 |
100 |
100 |
95.1 |
100 |
Single SRM test failure "user timeout".
|
24/12/12 |
100 |
100 |
100 |
100 |
100 |
|
25/12/12 |
100 |
100 |
45.7 |
100 |
100 |
Database problem with cryptic error.
|
26/12/12 |
100 |
98.3 |
38.1 |
99.5 |
100 |
Atlas - ongoing from 25/12. Alice & CMS: Monitoring /BDII problem.
|
27/12/12 |
100 |
100 |
73.6 |
90.6 |
100 |
Atlas - ongoing from 25/12. CMS: Monitoring /BDII problem plus a single SRM test failure.
|
28/12/12 |
100 |
100 |
100 |
95.9 |
100 |
Single SRM test failure "user timeout".
|
29/12/12 |
100 |
100 |
100 |
100 |
100 |
|
30/12/12 |
100 |
100 |
100 |
100 |
100 |
|
31/12/12 |
100 |
100 |
99.1 |
100 |
100 |
Single SRM Put failure.
|
01/01/13 |
100 |
100 |
100 |
91.8 |
100 |
Two SRM test failures "user timeout".
|