Tier1 Operations Report 2013-01-02

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 2nd January 2013

Review of Issues during the fortnight

19th December 2012 to 2nd January 2013.

This period mainly covers the Christmas & New Year Holidays (from Friday 21st Dec to Wednesday 2nd Jan). With the exception of the Atlas Castor database problem (see below) it was a fairly quiet period.

  • On Christmas Day (25th Dec) a problem appeared with the Atlas Castor stager and SRM databases. This took some time to track down and resulted in intermittent performance of the Atlas Castor instance until the 27th. The cause was finally traced to a bad error/warning return resulting from a database password that had not expired but was due to do so shortly.
  • On Tuesday (1st Jan) at the end of the afternoon one of the four top BDII nodes failed. The Top-BDII ran in a degraded manner until the following morning.
  • Over the holiday there were a couple of minor batch issues picked up and fixed by the on-call team although these did not significantly affect batch work.
Resolved Disk Server Issues
  • GDSS447 (AtlasDataDisk - D1T0) failed with a read only filesystem overnight 18/19 Dec. It was ready to go back into production the next day. However, owing to an error this was not done fully until 24th Dec.
  • GDSS449 (AtlasDataDisk - D1T0) failed with a read only filesystem on Tuesday 31st Dec. It was returned to production the next day (1st Jan).
Current operational status and issues
  • The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. One half of the new switchboard has been refurbished and was brought into service on 17 September. The work on the second is over-running slightly with an estimated completion of date of 13th January. (Original date was 18th Dec.)
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
  • Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
Ongoing Disk Server Issues
Notable Changes made this last week
  • On Wednesday/Thursday 19/20th Dec a firmware upgrade was rolled out to one batch of disk servers following a higher rate of problems in that batch.
  • The Post mortem report for the Power Incident on 20th November has been prepared and is available at: RAL_Tier1_Incident_20121120_UPS_Over_Voltage (Repeat of information in last report).
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
    • Improve the stack 13 uplink
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.


Entries in GOC DB starting between 19th December 2012 and 2nd January 2013.

There was one unscheduled outage in the GOC DB for this period which is for the Atlas Castor problems that began on Christmas Day.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas UNSCHEDULED OUTAGE 25/12/2012 06:00 25/12/2012 12:31 6 hours and 31 minutes ATLAS SRM database problems
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
89733 Red Urgent In Progress 2012-12-17 2012-12-20 RAL bdii giving out incorrect information
86152 Red Less Urgent On Hold 2012-09-17 2012-10-31 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
19/12/12 100 100 100 100 100
20/12/12 100 100 100 100 100
21/12/12 100 100 100 100 100
22/12/12 100 89.6 89.5 86.0 100 Monitoring problem affected a number of grid sites.
23/12/12 100 100 100 95.1 100 Single SRM test failure "user timeout".
24/12/12 100 100 100 100 100
25/12/12 100 100 45.7 100 100 Database problem with cryptic error.
26/12/12 100 98.3 38.1 99.5 100 Atlas - ongoing from 25/12. Alice & CMS: Monitoring /BDII problem.
27/12/12 100 100 73.6 90.6 100 Atlas - ongoing from 25/12. CMS: Monitoring /BDII problem plus a single SRM test failure.
28/12/12 100 100 100 95.9 100 Single SRM test failure "user timeout".
29/12/12 100 100 100 100 100
30/12/12 100 100 100 100 100
31/12/12 100 100 99.1 100 100 Single SRM Put failure.
01/01/13 100 100 100 91.8 100 Two SRM test failures "user timeout".