RAL Tier1 Incident 20111022 Castor Outage RAC Nodes Crashing

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 22nd October 2011: Castor Outage with RAC Nodes Crashing

Description:

The problems were caused by instabilities in the Oracle database infrastructure behind Castor. The Castor databases are divided across two Oracle RACs and both RACs suffered nodes crashing and, in some cases, failing to reboot. The failures for nodes to reboot were caused by corrupt areas on a disk array used to stage backups.

Impact

This problem resulted in an outage of the Castor storage system. On the morning of Saturday 22nd October there was an outage of both the Atlas and CMS Castor instances. Both Outages started at around 4am. That for Atlas lasted around 3.5hours, and for CMS a little over 4 hours.

There were then further problems that led initially to an outage of the CMS & GEN Castor instances (which was to last around 26 hours) and then to the remaining (Atlas & LHCb instances) which lasted around 16 hours. Services were restored (for all instances) and the Outage in the GOC DB ended at 21:30 on Sunday 23rd October.

Basic Timeline of the Incident

A More detailed timeline plus other information relevant to the analysis of this incident is placed at the bottom of this page.

When What
22nd October 04:04:57 SCSI errors reported on all nodes in both Oracle RACs used by Castor, followed by reboots of three of the five nodes in one of the RACs (Neptune).
22nd October 04:24:24 Primary on-call called out for a test timing-out which cleared itself. With no other errors reported the Primary on-call concluded the time-out was a transitory problem and took no further action.
22nd October 06:30 Call-out for failure of SAM tests on srm-Atlas. Primary On-call investigated and followed up (contacting Castor expert on-call). An alarm ticket was received from Atlas during these investigations.
22nd October 07:36 Atlas Castor local namseserver restarted and Atlas Castor service restored.
22nd October 08:19 CMS Castor service restored following CMS local nameserver restart.
22nd October 11:10 & 11:12 Crash (without reboot) of two nodes in the Pluto RAC. Database failover, Castor services continue.
22nd October 17:44 Third node out of four in Pluto RAC crashed.
22nd October 19:30 Castor On-Call stops CMS & GEN Castor instances, followed by the PLUTO databases being stopped ahead of any hardware intervention.
23rd October 04:08 Crash (with no reboot) of node in Neptune RAC that is used for backups.
23rd October 05:45 Following decision taken by on-call team all of Castor as well as the databases was stopped. Unable to declare an outage in the GOC DB at the time owing to it returning errors.
23rd Oct 09:00 (approx) Fabric Team personnel logged onto affected Pluto node via IPMI. Cause of failure to reboot found. Attempt to fix (fsck) from one remaining node that was up failed (node crashed and failed to reboot).
23rd Oct 14:00 and after Failing file systems removed from /etc/fstab and able to reboot all database servers that were down.
23rd Oct 17:00 With all nodes up for both RACs DB On-Call databases restarted and backups re-configured to use available nfs mounted partitions, avoiding those that give problems. Has successfully run a backup on PLUTO and is testing that on NEPTUNE.
23rd Oct 17:55 Castor back up but only with internal interfaces open.
23rd Oct 18:15 Contact Grid Services team to reduce FTS channels to 25% of normal values ahead of Castor restart.
23rd Oct 19:45 On-call Team review situation. With systems stable for couple of hours and with backups running every half hour. Agree to open Castor.
23rd Oct 20:30 Castor up. Kept under observation.
23rd Oct 21:30 Outage ended in GOC DB.

Incident details

Shortly after 04:00 on Saturday 22nd October the Primary On-call was alerted to a problem. However, the call-out was a timeout on a test that cleared itself. After some investigation the on-call concluded there was no problem. However, three nodes in one of the Oracle RACs ("Neptune") that host the Castor databases had rebooted and some parts of Castor were not working correctly. Subsequently (around 06:30) another call-out was raised following the failures of SAM tests, which was also followed by an alarm ticket from Atlas. Neither the Atlas nor CMS Castor instances were found to be working until the problems were resolved by the Castor on-call, with the Atlas instance back at 07:36 and CMS at 08:19 that morning.

Later that day two nodes in the second Oracle RAC ("Pluto") behind Castor crashed and failed to reboot. The Oracle databases failed over and services continued until a third node crashed at 17:44 (leaving only one node remaining). Following this the CMS and GEN Castor instances were stopped.

Shortly after 04:00 the following morning (Sunday 23rd October) another node in the Neptune RAC crashed and failed to reboot. This stopped backups of the database running and a decision was taken to stop the remaining (Atlas & LHCb) Castor instances.

During that morning remote access to the nodes was gained and the cause of the reboot failures understood. Corrupt areas were found on a disk array (the "bulk array") used to stage backups out of the RACs. It was the attempts to mount these areas that blocked the reboot. At the beginning of the afternoon the crashed nodes were rebooted by having the mount of the failed areas removed from the start-up. The database backups were re-configured to use other partitions on the bulk array that were intact in a simpler configuration with the array mounted on only one node in each RAC. Once this had been done Castor was started and tested. Following a couple of hours of stability, during which the backups had been running every 30 minutes, Castor access was opened up, with the outage being ended in the GOC DB for all instances at 21:30 that day.

Analysis

The initial trigger for this incident has been seen from the logs to have taken place shortly after 04:00 on Saturday 22nd October. This was a problem with access to the "bulk array" (this is a disk array connected over the fibrechannel SAN used to stage files for backups). These are seen as scsi errors. Following this the multipath system checked its links and reported that the route to the bulk array was down. The cause of the initial connectivity problem to the bulk array is not known. No errors were reported by the array or the fibrechannel SAN switch.

Following this initial event three of the nodes in the Neptune RAC crashed shortly after. All nodes in both Oracle RACs made use of this array and all were eventually to crash. The reason for the crashes are not known with no logging information produced, but the nodes would have continued trying to do I/O to the devices that were no longer available.

The three nodes that crashed at the start of the incident were able to reboot successfully and rejoin the Oracle RAC (Neptune). The rebooting of these nodes led to fail overs of the Oracle databases as expected. However, some of the Castor clients (NameServers in this case) were not able to cope with this and outages resulted for both Atlas and CMS until resolved by the Castor expert on-call.

Some seven hours after the first event, just after 11:00, two nodes in the second Oracle RAC (PLUTO) crashed and failed to reboot. The Oracle databases failed over correctly and castor services carried on running. On-call staff re-configured the backups to work around the lack of availability of some areas on the bulk array and services continued. However, later that day, at 17:44 a third node out of the four node Pluto RAC crashed. With only one node remaining the Castor services that require this RAC (CMS & GEN instances) were stopped.

At 04:08 the following morning (Sunday 23rd) another node in the Neptune RAC crashed and failed to reboot. Following this crash the Oracle database backups could not run and with this degradation it was decided to stop remaining Castor services (Atlas & LHCb instances).

Following some initial difficulty remote access (using IPMI) was gained to the systems during Sunday morning. The cause of the reboot failures was understood to be a failure to mount the corrupt areas on the bulk array. The mounting of these areas was removed from the configuration and all failed nodes booted. During the afternoon the backup scheme was modified to use uncorrupted areas on the bulk array that were still available and this was tested. Castor services were then restarted, initially with only internal interfaces enabled. After systems had been running for a couple of hours, with the half hourly backups also running successfully, the Castor services were made available. As a precaution the number of channels on the RAL FTS server and the number of batch jobs at RAL was held down, being ramped back up to full production values by Tuesday lunchtime (25th).

Follow Up

Issue Response Done
Initial failure not recognized and this caused a delay in the response. This is both that database nodes had rebooted and that errors were being recorded in the logs. Investigate adding an alarm on the reboot of key systems, not just their failure along with a check of the log files. yes
Castor services not running, and SAM tests failing for some time before call-out resulting in slower response. The current call-out system waits for multiple failures of SAM tests before calling out. This should be reviewed. yes
Delay in gaining IPMI access to the nodes. Although IPMI access is enabled to these systems on-call staff did not have all the required information needed readily to hand. Documentation ion IPMI access should be improved and tested. yes
The arrangement of the mounting of areas on the bulk array should be reduced so as to reduce the impact of this type of failure. The arrangement of disk mounts within each RAC was complex and may have exacerbated the problem of nodes failing to reboot. A simpler arrangement, with the bulk array being mounted on only one node in each RAC, is preferable and this was put in place as the backups were re-configured. yes
Remove dependence on the bulk array Preparations were already well advanced to migrate the databases onto newer hardware and provide resilience using Oracle DataGuard. These preparations to be pushed ahead so that alternative hardware would be available should there be another similar failure, and the databases can be migrated to the new infrastructure at the earliest suitable opportunity. yes

Reported by: Gareth Smith. 30th October 2011.

Summary Table

Start Date 22nd October 2011
Impact >80%
Duration of Outage For the worst affected Castor instance (CMS): 30 hours
Status Closed
Root Cause Unknown
Data Loss No


DETAILED ANALYSIS

Oracle Databases to hosts mapping

Neptune

DB instance Host
neptune1 cdbc08.gridpp.rl.ac.uk
neptune2 cdbc09.gridpp.rl.ac.uk
neptune3 cdbc13.gridpp.rl.ac.uk
neptune4 castor151.ads.rl.ac.uk
neptune5 cdbc04.gridpp.rl.ac.uk

Pluto

DB instance Host
pluto1 cdbd01.gridpp.rl.ac.uk
pluto2 cdbe02.gridpp.rl.ac.uk
pluto4 cdbd02.gridpp.rl.ac.uk
pluto5 cdbc14.gridpp.rl.ac.uk

Neptune Timeline of the Incident

When What
22nd October 04:04:57 /var/log/messages on all neptune nodes report SCSI errors at 04:05. This triggers multipath to scan all its paths. This reports that path to bulk array is lost and is configured down. SCSI errors continue on sdl and dm-9 (cdbc09, slight differences on numbers for other nodes) until the node restarts. No indication why restart. Only 151 stays up (generating errors) till 04:07 the next day when it re-boots. Extracts from log on cdbc09 -
Oct 22 04:04:57 cdbc09 kernel: SCSI error : <7 0 3 0> return code = 0x10000
Oct 22 04:04:57 cdbc09 kernel: end_request: I/O error, dev sdl, sector 0
.
Oct 22 04:04:57 cdbc09 multipathd: Bulk-Log-Neptune-D: event checker started
Oct 22 04:04:57 cdbc09 multipathd: Bulk-Log-Neptune-B: event checker started
.
Oct 22 04:04:57 cdbc09 multipathd: 8:176: readsector0 checker reports path is down
Oct 22 04:04:57 cdbc09 multipathd: checker failed path 8:176 in map Bulk-Log-Neptune-B
Oct 22 04:04:57 cdbc09 multipathd: Bulk-Log-Neptune-B: Entering recovery mode: max_retries=5
Oct 22 04:04:57 cdbc09 multipathd: Bulk-Log-Neptune-B: remaining active paths: 0
Oct 22 04:04:57 cdbc09 multipathd: Bulk-Log-Neptune-B: Entering recovery mode: max_retries=5
.
Oct 22 04:05:22 cdbc09 multipathd: Bulk-Log-Neptune-B: Disable queueing
Oct 22 04:05:22 cdbc09 kernel: Buffer I/O error on device dm-9, logical block 0
.
Oct 22 04:05:30 cdbc09 kernel: Buffer I/O error on device sdm, logical block 1
Oct 22 04:05:30 cdbc09 kernel: SCSI error : <7 0 3 1> return code = 0x10000
Oct 22 04:10:59 cdbc09 syslogd 1.4.1: restart.
Oct 22 04:10:59 cdbc09 syslog: syslogd startup succeeded
Oct 22 04:10:59 cdbc09 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Oct 22 04:10:59 cdbc09 kernel: Linux version 2.6.9-89.0.9.ELsmp (mockbuild@x86-007.build.bos.redhat.com)
22nd October 04:06:04 EMC arays report that nodes have been lost -
Event 720e has occurred on storage system FCNMM084700137 for device N/A at 10/22/11 03:04:04 
the event description is: Initiator (20:00:00:E0:8B:94:B0:E0:21:00:00:E0:8B:94:B0:E0) on Server
(cdbc09.gridpp.rl.ac.uk) registered with the storage system is now inactive. See the storage system Attention 
Required page for details. [note GMT][Note 0137 2min slow]
22nd October 04:06:11.472 cdbc13 (Neptune3) has lost contact with cdbc09:
 [cssd(13512)]CRS-1612:node cdbc09 (2) at 50% heartbeat fatal, eviction in 29.094 seconds
22nd October 04:06:40.901 cdbc09 is evicted from the RAC:
 [cssd(13512)]CRS-1601:CSSD Reconfiguration complete. Active nodes are cdbc08 cdbc13 castor151 cdbc04
22nd October 04:07:12.896 cdbc13 (Neptune3) has lost contact with cdbc04:
 [cssd(13512)]CRS-1612:node cdbc04 (5) at 50% heartbeat fatal, eviction in 29.104 seconds
22nd October 04:07:42.280 cdbc04 is evicted from the RAC
 [cssd(13512)]CRS-1601:CSSD Reconfiguration complete. Active nodes are cdbc08 cdbc13 castor151
22nd October 04:10:57.101 cdbc13 (Neptune3) has lost contact with cdbc08
 [cssd(13512)]CRS-1612:node cdbc08 (1) at 50% heartbeat fatal, eviction in 29.008 seconds
22nd October 04:11:26.673 cdbc08 is evicted from the RAC. Note at this stage Neptune is running only on two nodes:
 [cssd(13512)]CRS-1601:CSSD Reconfiguration complete. Active nodes are cdbc13 castor151
22nd October 04:12 Call-out generated (relating to cdbc09 not being available (a test timeout) which cleared).
22nd October 04:12:03.345 cdbc09 (Neptune2) can't read OCR partitions (/dev/raw/raw4 /dev/raw/raw3):
  [  OCROSD][3086927552]utopen:7:failed to open OCR file/disk /dev/raw/raw4 /dev/raw/raw3, errno=13, os err string=Permission denied
22nd October 04:12:36.892 cdbc09 is up and joins the RAC:
 [cssd(13512)]CRS-1601:CSSD Reconfiguration complete. Active nodes are cdbc09 cdbc13 castor151
22nd October 04:12:48.040 cdbc04 (Neptune5) can't read OCR partitions (/dev/raw/raw4 /dev/raw/raw3):
 [  OCROSD][3086902976]utopen:7:failed to open OCR file/disk /dev/raw/raw4 /dev/raw/raw3, errno=13, os err string=Permission denied
22nd October 04:13:29.214 cdbc04 is up and joins the RAC:
 [cssd(13512)]CRS-1601:CSSD Reconfiguration complete. Active nodes are cdbc09 cdbc13 castor151 cdbc04
22nd October 04:15:35.111 cdbc08 (neptune1) can't read OCR partitions (/dev/raw/raw4 /dev/raw/raw3):
 [  OCROSD][3086927552]utopen:7:failed to open OCR file/disk /dev/raw/raw4 /dev/raw/raw3, errno=13, os err string=Permission denied 
22nd October 04:16:14.201 cdbc08 is up and joins the RAC:
 [cssd(13512)]CRS-1601:CSSD Reconfiguration complete. Active nodes are cdbc08 cdbc09 cdbc13 castor151 cdbc04
22nd October 04:24:24 Primary on-call investigating.
22nd October 04:29:40 Primary on-call notes "Machine appears fine now. Was so busy that ganglia has no data for about 5 minutes around

04:05-04:10."

22nd October 06:30 Call-out for failure of SAM tests on srm-Atlas.
22nd October 06:42 Primary on-call Investigating
22nd October 07:00 Alarm ticket from Atlas
22nd October 07:08 CASTOR On-call Investigating
22nd October 07:23 GGUS Alarm ticket acknowledged by POC
22nd October 07:25 Looks like ATLAS has been dead since 0413. Looks like the ATLAS name server client on catlasdlf was not working.I have restarted it and everything has sprung into life again. SAM test submitted - waiting for green light.
22nd October 07:36 Atlas Castor service restored.
22nd October 08:31:16 Neptune backup failed. One of the two backup partitions is lost.
22nd October 14:00 Declared 'Warning' in GOC DB for all Castor end points until Monday (25th October).
22nd October 14:10 Production Manager and Tier-1 Manager discuss situation - merits of at risk versus halting service. Production Manager holds token to decide.
23rd October 04:08:15.408(*) castor151 crash. EMC arrays loose contact with castor151. Reboot at 15:05 on the 23rd (Tim)
23rd October 04:40 Alarms as node Castor151 has crashed and not rebooted. Remaining Castor instances (Atlas & LHCb) seem to be running. Contact database team (Carmine) who report that the database backups are not running. Following discussion decide to stop Atlas & LHCb Castor instances.
23rd October 05:45 Castor all down. Carmine then put databases in a safe configuration (down). Could not declare an outage in the GOC DB as that returned errors. Severely reduced numbers of batch jobs the LHC VOs could run to stop new work starting (and failing).
23rd Oct 14:00 and after After discusion with Gareth and Carmine Fabric Team (Tim) edited /etc/fstab to remove duff filesystem and rebooted. castor151 came up OK.
23rd Oct 17:00 With all nodes up for both RACs DB On-Call (Carmine) has restarted databases and reconfigured the backups to use available nfs mounted partitions, avoiding those that give problems. Backup on NEPTUNE is running.
23rd Oct 17:35 Production Manager and Tier-1 Manager discuss merits of going live versus waiting until Monday. Production Managers holds the token to decide.
23rd Oct 17:55 Castor back up but only with internal interfaces open.
23rd Oct 18:15 Contact Grid Services team to reduce FTS channels to 25% of normal values ahead of Castor restart.
23rd Oct 19:45 Review of Situation between Primary, Castor & Database On-Call people (Gareth, Matthew, Carmine). System has been stable for couple of hours with backups running every half hour. Agree to open Castor.
23rd Oct 20:30 Castor up. Kept under observation.
23rd Oct 21:30 Outage ended in GOC DB.
23rd Oct 22:00 Batch limits for LHC VOs partially lifted.
27th Oct 04:14 cdbc13 crash. Been generating SCSI errors since the initial errors on bulk array. Comes up OK.

(*) Time as detected by an alive node in the Oracle RAC

Pluto Timeline of the Incident

When What
22nd October 04:05 /var/log/messages on all Pluto nodes report SCSI errors at 04:04:57. This triggers multipath to scan all its paths. This reports that path to bulk array is lost and is configured down. SCSI errors continue on sdl and dm-13 (cdbc14, slight differences on numbers for other nodes) until the node restarts. No indication why restart. Pluto reboots not at time of initial errors like neptune.
22nd October 07:40 Call-out on failure of tests on srm-cms.
22nd October 08:19 CMS Castor service restored.
22nd October 11:10:52.657 (*) cdbc14 (pluto5) crashed. Service failed-over means Castor services carry on uninterrupted.
22nd October 11:12:48.349 (*) cdbd02 (pluto4) crashed. Service failed-over means Castor services carry on uninterrupted. DB Team On-Call (Carmine) Confirmed that: Database up OK with systems are writing to both EMC disk arrays. He was able to re-configure the backups to get round a problem of one of the partitions used being read-only.
22nd October 14:00 Declared 'Warning' in GOC DB for all Castor end points until Monday (25th October).
22nd October 17:44:31.966 (*) cdbd01 (pluto1) crashed.
22nd October 17:54 Alarm following crash of cdbd01. This is one of the two remaining nodes in the PLUTO RAC.
22nd October 19:30 Castor on Call stops CMS & GEN Castor instances, followed by the PLUTO databases being stopped ahead of any hardware intervention.
22nd October 21:40 Fabric Team (Tim) has checked - the database disk arrays (EMCs) look OK as does the fibrechannel switch.
23rd October 05:45 Castor all down. Could not declare an outage in the GOC DB as that returned errors. Severely reduced numbers of batch jobs the LHC VOs could run to stop new work starting (and failing).
23rd Oct 09:00 (ish) Logged onto affected node via IPMI. servers would not boot due to filesystem not being availalbe. Tried fsck'ing from cdbc02 but that caused server to re-boot and not come up.
23rd Oct 14:00 and after After discusion with Gareth and Carmine Fabric Team (Tim) edited /etc/fstab to remove duff filesystem and rebooted. All servers in the PLUTO RAC came up OK. Same done to castor151 (NEPTUNE RAC) that had similar problems.
23rd Oct 17:00 With all nodes up for both RACs DB On-Call (Carmine) has restarted databases and reconfigured the backups to use available nfs mounted partitions, avoiding those that give problems. Has successfully run a backup on PLUTO and is testing that on NEPTUNE.
23rd Oct 17:55 Castor back up but only with internal interfaces open.
23rd Oct 18:15 Contact Grid Services team to reduce FTS channels to 25% of normal values ahead of Castor restart.
23rd Oct 19:45 Review of Situation between Primary, Castor & Database On-Call people (Gareth, Matthew, Carmine). System has been stable for couple of hours with backups running every half hour. Agree to open Castor.
23rd Oct 20:30 Castor up. Kept under observation.
23rd Oct 21:30 Outage ended in GOC DB.
23rd Oct 22:00 Batch limits for LHC VOs partially lifted.

(*) Time as detected by an alive node in the Oracle RAC

Incident details

File system errors reported. First times here. Lots after

Oct 22 04:05:05 cdbd02 kernel: EXT3-fs error (device dm-14): ext3_journal_start_sb: Detected aborted journal
Oct 22 04:06:40 cdbe02 kernel: EXT3-fs error (device dm-14): ext3_journal_start_sb: Detected aborted journal
Oct 22 04:12:29 cdbd01 kernel: EXT3-fs error (device dm-16): ext3_find_entry: reading directory #2 offset 1
Oct 22 04:12:29 cdbc14 kernel: EXT3-fs error (device dm-13): ext3_find_entry: reading directory #2 offset 1
Oct 22 08:09:49 castor151 kernel: EXT3-fs error (device dm-19): ext3_journal_start_sb: Detected aborted journal

EMC arrays reported that servers disapeared at about the same time (Note GMT)

Event 720e has occurred on storage system FCNMM084700137 for device N/A at 10/22/11 03:04:04 The event description is: Initiator (20:00:00:E0:8B:94:B0:E0:21:00:00:E0:8B:94:B0:E0) on Server (cdbc09.gridpp.rl.ac.uk) registered with the storage system is now inactive. See the storage system Attention Required page for details.

IPMI details for DB servers found. Reboot of servers had no effect. Logs showed that servers would not reboot as one of the file-systems was not available. Tried to fsck affected file-system, but no luck. Caused cdbe02 to go off-line. Commented out dead file-system from fstab. Allowed servers to be re-booted and to come up.

Also did same for castor151 which had lost one of its two logging areas. All other Neptune nodes were OK

DBs restarted without bulk logging areas available.

The Neptune nodes cdbc08,cdbc09,cdbc04 were rebooted by oracle because couldn't read the OCR partitions. All the nodes showed the same error message:

 Oracle Database 10g CRS Release 10.2.0.5.0 Production Copyright 1996, 2008 Oracle.  All rights reserved.
 2011-10-22 04:15:35.111: [  OCROSD][3086927552]utopen:7:failed to open OCR file/disk /dev/raw/raw4 /dev/raw/raw3, errno=13, os err string=Permission denied
 2011-10-22 04:15:35.111: [  OCRRAW][3086927552]proprinit: Could not open raw device
 2011-10-22 04:15:35.111: [ default][3086927552]a_init:7!: Backend init unsuccessful : [26]
 2011-10-22 04:15:35.111: [ CSSCLNT][3086927552]clsssinit: Unable to access OCR device in OCR init.PROC-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]

NOTE this error is after the server has been rebooted, so an effect, not a cause.

The crash of Castor151 (Neptune) and cdbd01,cdbd02,cdbc14(Pluto) did not generate any logging from Oracle side.