Tier1 Operations Report 2012-11-14

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 14th November 2012

Review of Issues during the fortnight 31st October to 14th November 2012
  • On Sunday 4th Nov there was a problem with the database behind the Atlas SRM that led to an outage of the Atlas SRM for around six hours during the afternoon.
  • On Sunday morning (4th Nov) at around 05:00 the OPN link to CERN flipped such that routing was asymmetric with packets one way over the primary link, the other way over the backup. This was fixed on Monday (5th). Initially both directions were flipped to use the backup link both ways, then early afternoon the problem was fully resolved and data reverted to using the primary link both ways.
  • On Saturday a problem was reported with slow data export rates for Atlas from the Tier0 to RAL. The underlying cause of this was not found, although the problem was resolved by Monday. It is notable that this overlapped with periods of high data rates for Atlas on other links; the OPN issues referred to above as well as the SRM database problem on Sunday, also referred to above.
  • On Wednesday 7th November there was a power outage that affected the RAL site, and for which the backup power via a diesel generator did not work. The power cut happened at around 11:30 on Wednesday 7th. Core services (TopBDII, FTS) were returned to service at the end of that afternoon (although there was a subsequent problem with the FTS service that meant it was down overnight). All services (including Castor & Batch) were back around 14:00 the next day. A Post Mortem report is being prepared for this incident.
  • There was a problem that affected batch services on Saturday (10th Nov) owing to a problem updating CERN CRLs.
  • There was an outage of the Atlas SRM on Sunday (11th Nov) caused by a problem with the Atlas SRM database.
  • Over the weekend (10/11 Nov) there was a problem with some Castor disk servers not having time correctly synchronised that casued some Castor access failures.
Resolved Disk Server Issues
  • GDSS565 (AtlasDataDisk - D1T0) crashed on the morning of Thursday 1st Nov. It was restarted and checked out, being returned to service the following morning (2nd).
  • GDSS436 (AtlasDataDisk - D1T0) Failed with read only file system in the early hours of Friday (2nd Nov). It was returned to service on Saturday morning (3rd). One file was found corrupted and reported unrecoverable to Atlas.
  • GDSS443 (AtlasDataDisk - D1T0) Also failed with read only file system early on Friday (2nd Nov). Also returned to service on Saturday morning (3rd). Two files were found corrupted and reported unrecoverable to Atlas.
  • GDSS462 (AliceTape - D0T1) Failed with a read-only file system opn Monday evening (5th Nov). It was returned to service this morning (7th Nov).
  • GDSS420 (AliceTape - D0T1) reported a read only file system during the afternoon of Tuesday 6th November. After RAID verification (and a delay owing to a power cut) it was returned to production on Friday 9th Nov.
  • GDSS206, GDSS229, GDSS272, GDSS273 (all AtlasScratchDisk - D1T0). These machines had problems following the power outage. They were returned to service on Friday, 9th Nov.
  • GDSS647 (LHCbDst - D1T0) Reported a disk partition inaccessible on Friday 9th Nov. A disk was replaced and the system returned to production a couple of hours later.
  • GDSS437 (AtlasDataDisk - D1T0) Reported a read only file system in the early hours of Saturday (10th Nov). It was returned to production later that day. However, it reported several checksum errors and was taken out of production for additional checks on Tuesday (13th) - being returned to production again this morning (14th).
Current operational status and issues
  • Should there be another power outage the backup power via the diesel generator will not work. An investigation, hopefully a fix, and re-test is scheduled for Tuesday 20th Nov.
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
  • Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates. the opportunity has been taken to make further measurements when the network has been quiet after the power outage and scheduled network intervention.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • On Thursday (1st Nov) a change was made to the EMI CREAM CEs to increased FTP connections. This resolved a problem of intermittent SUM test failures.
  • On Thursday (1st Nov) a patch was applied to the FTS service that should cure the intermittent failures of the FTS system seen in recent weeks.
  • On Tuesday (6th Nov) a start was made on rolling out the use of hyperthreading on the worker nodes. SL09 machines are now running 10 jobs each (up from 8) Dell 11 machines (the original test batch) are now running 20 jobs each (up from 18).
  • On Tuesday morning, 13th Nov, Castor & batch services were suspended around a network interruption while a board was changed in a Network router. During the Castor stop the opportunity was taken to enable some statistics gathering on the Atlas SRM database and make a change resolve the DB problems behind this service.
  • On Tuesday morning, 13th Nov, there was a minor upgrade to the CIP (Castor Information Provider) to fix a problem of accounting for nearline storage (affects LHCb, T2K & SNO+).
  • CVMFS continues to be available for testing by non-LHC VOs (including "stratum 0" facilities).
  • Test instance of FTS version 3 continues to be available.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • 20th November: Intervention on the "Essential Power Board" and investigation into the panel that controls the diesel generator cutting in followed by UPS load test.
  • The continued roll-out of the use of hyperthreading on the worker nodes will take place.
  • Plans are advanced for a migration of worker nodes to EMI-2/SL5 and this will start soon.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None
  • Networking:
    • Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
    • Improve the stack 13 uplink
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Enabling overcommit on WNs to make use of hyperthreading (will be implemented after the CE upgrades are complete).
    • migration to EMI software for worker nodes.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" (scheduled for 20th November) & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.


Entries in GOC DB starting between 31st October and 14th November 2012

There were three unscheduled outages in the GOC DB for this period. One when there were problems with the Atlas SRM database (followed by an unscheduled warning period.) The other two were the result of the RAL site power cut.

Service Scheduled? Outage/At Risk Start End Duration Reason
All CEs (batch) and Castor SCHEDULED OUTAGE 13/11/2012 08:30 13/11/2012 09:30 1 hour Storage (Castor) and batch paused while network router card replaced.
lcglb01.gridpp.rl.ac.uk, SCHEDULED OUTAGE 10/11/2012 12:00 30/11/2012 14:00 20 days, 2 hours host retirement
lcglb02.gridpp.rl.ac.uk, SCHEDULED OUTAGE 10/11/2012 12:00 30/11/2012 14:00 20 days, 2 hours host retirement
All CEs (batch) and Castor UNSCHEDULED OUTAGE 08/11/2012 12:00 08/11/2012 14:00 2 hours Storage (Castor) and batch services still down following yesterday's Power Outage.
Whole site UNSCHEDULED OUTAGE 07/11/2012 11:15 08/11/2012 12:00 1 day, 45 minutes Power cut at RAL
srm-atlas UNSCHEDULED WARNING 04/11/2012 18:28 05/11/2012 12:00 17 hours and 32 minutes At-risk on ATLAS SRM following the problems on Oracle DB
srm-atlas UNSCHEDULED OUTAGE 04/11/2012 12:00 04/11/2012 18:29 6 hours and 29 minutes Outage while we investigate problems on the Oracle DB behind Atlas SRM
srm-lhcb SCHEDULED WARNING 31/10/2012 11:00 31/10/2012 12:00 1 hour Swapping one the Castor headnodes back following repair after hardware failure.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
86690 Red Urgent In Progress 2012-10-03 2012-11-06 T2K JPKEKCRC02 missing from FTS ganglia metrics
86152 Red Less Urgent On Hold 2012-09-17 2012-10-31 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
31/10/12 100 100 100 100 100
01/11/12 97.0 100 100 95.9 100 OPS - Monitoring problem in Regional Nagios; CMS - timeout.
02/11/12 100 100 95.6 87.7 91.7 Mainly "could not open connection to srm-cms..." errors that correlate with failures at other sites.
03/11/12 100 100 93.8 95.9 91.8 Almost all "could not open connection to srm-cms..." errors that correlate with failures at other sites.
04/11/12 100 100 75.6 100 100 Problem with Atlas SRM database.
05/11/12 100 100 95.8 95.9 100 All "could not open connection to srm-cms..." errors that correlate with failures at other sites.
06/11/12 100 100 100 100 100
07/11/12 60.6 95.0 57.4 57.8 55.6 Site-wide power cut.
08/11/12 39.9 47.9 39.2 34.4 46.2 Site-wide power cut.
09/11/12 100 100 100 100 100
10/11/12 85.2 94.3 100 62.7 100 Mainly CE test failures following problem updating CRLs.
11/11/12 100 100 84.0 79.4 100 Atlas: Problem with SRM Database; CMS: "user timeouts" in Castor.
12/11/12 100 100 89.1 95.9 100 Atlas: Config error stopped CRL update; CMS: "user timeout" in Castor.
13/11/12 95.8 97.5 95.8 91.4 97.5 Scheduled outage for network router board swap.