Tier1 Operations Report 2012-12-05

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 5th December 2012

Review of Issues during the week 28th November to 5th December 2012
  • Overnight Wed/Thu 28/29 Nov there was a problem that affected all of Castor caused by a crash of the Castor permissions database. This was resolved by the Castor On-Call Team during the night.
  • On Thursday 29th Nov there were problems with the Castor CMS instance traced to a database problem. Fixed at the end of the afternoon by moving the CMS castor database to a different node.
  • On Friday (30th Nov) there was a problem with the Atlas Frontier service with both nodes unavailable. Atlas raised a GGUS ticket. The problem was fixed and the monitoring of this service is being improved.
  • On Monday (3rd Dec) a high rate of failures for Atlas Castor were seen. This was fixed by a bounce of the Atlas Castor database at around 13:00 that day.
  • On Tuesday morning (4th Dec), around 08:00 local time, there was a transitory problem that caused a high rate of Castor SRM failures (seen in the FTS). The root of the problem has not been definitively identified but appears to be a network issue.
Resolved Disk Server Issues
  • GDSS673 (CMSTape - D0T1) crashed on Tuesday morning, 27th Nov. It was returned to production on Saturday (1st Dec) following a firmware update (required to help identify a faulty disk within the array) and RAID array verification.
  • GDSS647 (LHCbDst - D1T0) failed on Thursday (29th) with a problem on a system partition. returned to service on Monday (3rd Dec).
  • GDSS661 (AtlasDataDisk - D1T0) crashed on Saturday (1st Dec) - returned to service on Monday (3rd Dec).
Current operational status and issues
  • Following the power incident the Tier1 has been running with reduced resilience, particularly as regards power supplies for the fibrechannel SAN switches used in the database infrastructure. This particular issue is now resolved. Work continues to replace and re-stock items such as power supplies and PDUs.
  • There is an ongoing problem with Castor Atlas and GEN stager daemons using memory. A regular re-starter is now in place for this daemon and further investigations are taking place with assistance from the Castor developers.
  • The batch server process sometimes consumed memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
  • Following checks made on 20th November (at the time of the power incident) it is believed the diesel generator should now work in the event of a further power cut. However, this has not yet been tested. A test (To be confirmed) is proposed for Tuesday 11th December.
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
  • Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • On Tuesday (4th Dec) replacement power supplies for the fibrechannel SAN switches used in the database infrastructure were obtained and installed. This removes the most significant resilience issue remaining after the power incident of 20th November.
  • The final two batches of worker nodes were drained over the weekend and upgraded to EMI-2 (SL5) on Monday (3rd Dec).
  • The final two batches of worker nodes had their overcommit increased to make use of hyperthreading yesterday (4th Dec.)
Declared in the GOC DB
  • Thursday 6th December. Warning on Castor 'GEN' instance while debugging Castor stager memory leak.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None
  • Networking:
    • Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
    • Improve the stack 13 uplink
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
  • Infrastructure:
    • Test of move to diesel power in event of power loss. (Proposed - Tuesday 11th December).
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.


Entries in GOC DB starting between 28th November and 5th December 2012

There were no unscheduled outages in the GOC DB for this period.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgui02.gridpp.rl.ac.uk, SCHEDULED OUTAGE 28/11/2012 10:00 28/11/2012 12:00 2 hours Re-install with EMI software version (Upgrade postponed from last week).
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
88596 Red Very Urgent In Progress 2012-10-19 2012-12-01 T2K Jobs don't get delgated to RAL
86690 Red Urgent In Progress 2012-10-03 2012-12-04 T2K JPKEKCRC02 missing from FTS ganglia metrics
86152 Red Less Urgent On Hold 2012-09-17 2012-10-31 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
28/11/12 100 100 100 95.7 100 Single failure of SRM test "User timeout over"
29/11/12 96.9 100 100 62.5 91.9 Castor permission DB crashed plus for CMS another DB problem.
30/11/12 100 100 100 100 95.8 Single SRM test failure. Probably caused by reboot of Router A.
01/12/12 100 100 100 100 100
02/12/12 100 100 100 100 100
03/12/12 100 100 99.1 100 100 Single SRM test failure. Database problem.
04/12/12 100 100 99.5 95.9 100 Single failures of SRM tests. Transient network problem.