Tier1 Operations Report 2012-12-19

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 19th December 2012

Review of Issues during the week

12th to 19th December 2012

  • On Tuesday (18th December) there was failure of one of the site routers that took the entire Tier1 off-air at 06:45. The Router was fixed and the configuration verified around 3 hours later. Following this there was a period of verifying various systems and connection within the Tier1 and the outage was ended in the GOC DB at 10:45. Some problems were reported with the batch system after this and these were resolved finally around 15:00.
Resolved Disk Server Issues
  • GDSS443 (AtlasDataDisk - D1T0) failed with a read only filesystem on Thursday 13th Dec. It was returned to production the next day. One disk was found to be faulty.
  • GDSS449 (AtlasDataDisk - D1T0) failed with a read only filesystem on Sunday 16th Dec. It was returned to production the next day.
Current operational status and issues
  • We have seen an increasing rate of failures on one of the '08 batches of disk servers. A program of upgrading the disk controller firmware in this batch is under way.
  • The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. One half of the new switchboard has been refurbished and was brought into service on 17 September. The work on the second is over-running slightly with an estimated completion of date of 13th January. (Original date was 18th Dec.)
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
  • Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
Ongoing Disk Server Issues
  • GDSS447 (AtlasDataDisk - D1T0) failed with a read only filesystem last night and is undergoing investigation.
Notable Changes made this last week
  • On Monday (17th Dec) the Castor Information provider was upgraded to fix an issue where one of LHCb's paths was showing as undefined.
  • The Post mortem report for the Power Incident on 20th November has been prepared and is available at: RAL_Tier1_Incident_20121120_UPS_Over_Voltage
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
    • Improve the stack 13 uplink
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.


Entries in GOC DB starting between 12th and 19th December 2012

There were four unscheduled outages in the GOC DB for this period. Three were for the problem with the Atlas SRMs last week (Wed 12th Dec). The other was the site outage caused by the Network Router failure yesterday morning (18th Dec.)

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole site UNSCHEDULED OUTAGE 18/12/2012 06:45 18/12/2012 10:45 4 hours Hardware failure in core site network has taken RAL Tier1 off-air.
srm-atlas.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 12/12/2012 13:30 12/12/2012 14:57 1 hour and 27 minutes Ongoing problem with Atlas SRM being investigated.
srm-atlas.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 12/12/2012 11:45 12/12/2012 13:30 1 hour and 45 minutes Ongoing problems with Atlas SRM.
srm-atlas.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 12/12/2012 10:30 12/12/2012 11:45 1 hour and 15 minutes There are problems with the Atlas srm Database.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
89733 Red Urgent In Progress 2012-12-17 2012-12-18 RAL bdii giving out incorrect information
86152 Red Less Urgent On Hold 2012-09-17 2012-10-31 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
12/12/12 100 100 80.6 100 100 Problems with Atlas SRM.
13/12/12 100 98.6 100 100 100 Timeout for the job exceeded.
14/12/12 100 100 100 100 100
15/12/12 100 100 100 100 100
16/12/12 100 100 100 95.9 100 Single SRM test failure "user timeout".
17/12/12 100 100 99.2 100 100 Single error while deleting test file.
18/12/12 71.2 76.0 63.7 64.7 87.5 Site Network problem (Router A failure) followed by some CE problems.