RAL Tier1 Incident 20100129 Extended outage migrating Castor databases

From GridPP Wiki
Jump to: navigation, search

Problems Migrating Castor Databases led to Extended Outage

Site: RAL-LCG2

Incident Date: 2010-01-29

Severity: Severe

Service: CASTOR - all instances

Impacted: All VOs.

Incident Summary: A scheduled outage to migrate the Castor Databases back to their original disk arrays encountered significant problems resulting in an extended outage.

Type of Impact: Loss of Service

Incident duration: 5 days overrun on a scheduled two-day outage.

Report date: 2010-02-01. Updated 2010-02-09.

Reported by: Gareth Smith

Related URLs Post Mortem for castor Disk Array Failure: http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091004

Incident details:

A Planned intervention on Wednesday & Thursday 27/28 January overran significantly. The relevant item of work was the migration of the Castor Oracle databases back onto the disk arrays intended for the purpose. The databases had been moved elsewhere temporarily following problems in October 2009, an incident detailed in a previous Post Mortem. The end point of the migration was to have the databases running on a hardware configuration (Oracle RAC, Storage Area Network, Disk Array Pair) essentially the same as that in use before October 2009. This configuration has two databases (called Neptune and Pluto), each with a five-node Oracle RAC, using a common SAN and pair of disk arrays.

There was an initial delay in commencing the database migration during the scheduled outage. A configuration of the mounting of disk areas on the Oracle RAC nodes had not been completed on schedule. Rather than starting at around 10am, the data migration of the first of the pair of databases started at around 14:30, for the second at 17:00 on the first day.

The data migration was successful and was completed by the morning of Thursday 28th January. Following a reboot there was a problem accessing the database on Neptune that took several hours to resolve. This was traced to an undocumented Oracle issue.

The planned resilience testing was then carried out. This showed unexpected results. Removal of one of the RAC nodes from Pluto caused the whole RAC to crash. On Neptune the same test caused some, but not all, of the other RAC nodes to crash. During the following night the SAN multipath started flapping between routes. The following day (Friday 29th) this was traced to a hardware fault that had developed on one of the RAC nodes. The faulty node was then removed from the RAC.

During Friday and over the weekend of 30/31 January investigations took place to try and understand the causes of the lack of resilience. These were largely overcome by turning off the multipath capabilities of the SAN. Nevertheless a failure of one of the pair of disk arrays still caused the Pluto rack to fail, and on some occasions Neptune would also fail. The lack of multipath means there is only limited resilience to failures in the fibrechannel SAN. (These limitations remain in the configuration returned to production on 2nd February.)

Following the re-stabilisation of the system final checks were made. The databases were tested and verified to be synchronized with Castor. The Castor Nameserver was moved from Pluto to Neptune as that system shows better resilience. Castor services were resumed around 11am On Tuesday 2nd February. During that afternoon FTS channels were re-opened and batch service restarted.

Since restarting services work has been ongoing to find the cause of the instabilities. A review of the configuration of the production system is taking place. Also, a separate test instance, as similar as available hardware will allow, is being created to try and replicate the problem.

Future mitigation:

The RAL Tier1 has recently introduced a formal change control procedure. However, this database migration which had been planned for some time, pre-dated that process and was not reviewed by it. One component of the change, relating to the configuration for mounting disk areas had been reviewed. Despite this some aspects of this part of the change had not been sufficiently resolved ahead of the intervention. It is essential that all changes are effectively reviewed by the change control process.

A significant amount of resilience testing had taken place ahead of the intervention. This was driven following problems last October. Those tests did show the systems had the expected resilience. These tests used the same disk arrays as in the final configuration but a different test RAC with fewer nodes. The ability of those tests to replicate issues in the production system needs to be reviewed.

Related issues:

  • Other similar database migrations (for the LFC, FTS and 3D services) were migrated back to virtually identical hardware without significant issues.
  • The outage of RAL Castor meant that the Tier1 batch services were also down.

Timeline

Date Time Comment
Start of original scheduled outage to Castor as well as LFC & FTS plus other services. 2010-01-27 08:00
Start of data migration for Neptune database. 2010-01-27 10:30 Database Team
Start of data migration for Pluto database. 2010-01-27 17:00 Database Team
Other planned work had continued. LFC and FTS returned to service following database migration. 2010-01-27 18:12 Database Team
Following migration of Neptune database, upon a reboot of the RAC a problem was found with the disk headers that took about 3 to 4 hours to resolve. CERN & Oracle contacted. Traced to undocumented Oracle issue. (In meantime work continued on Pluto database.) Following this resilience testing was carried out on both databases. 2010-01-28 09:00 Keir Hawker & Database Team
Advisory issued that outage could overrun. Outage in GOC DB extended to 14:00 on 29/1/10. Notifications sent. 2010-01-28 10:30 Gareth Smith.
Resiliency Test Failures 2010-01-28 15:00 Database & Castor Teams
End of original scheduled outage 2010-01-28 17:00
'Flapping' of multipath routes started. 2010-01-29 03:00
Review of status. 'Pluto' will only run with one database node. RAC cannot be brought into production. Multipath 'flapping'. Also issue of Oracle loosing contact with ASM on boot of third node in RAC. Downtimes extended to after weekend (Monday 1st February). 2010-01-29 10:30 Andrew Sansum & team
Review of status. 'Pluto 'still the same. 'Neptune' down but operable. 2010-01-29 13:50 Andrew Sansum & team
Multipath 'flapping' fixed by removal of faulty node ("D03") from Pluto RAC. Review of status. Attempts to resolve issue on Pluto unsuccessful. Final test on Neptune shows same behaviour (RAC does not survive restart of single node). Extend downtime to Wednesday (3 Feb) 14:00. 2010-01-29 15:00 Andrew Sansum & team
From information provided by CERN it was discovered that a stable configuration (resilient to removal of a RAC node) could be maintained using single path instead of multipath over the SAN. Following various investigations into this one of the pair of fibrechannel switches was turned off. This gave the systems resilience against the failure of a RAC node. 2010-01-30 21:00 Castor Team
During day various configuration issues were explored, further resilience tests carried out and appropriate re-synchronisation of the databases carried out. 2010-01-31 18:00 Castor Team
Databases tested and found to be in synch with CASTOR 2010-01-31 22:00 Matthew Viljoen & Castor Team
Review of status. Significant progress over weekend. Both 'Pluto' and 'Neptune' RACs stable against nodes restarting. Multipath disabled on SAN. Some resilience issues if failure of either disk array. Castor checks all OK for all instances except LHCb (connection issues to database). 2010-02-01 10:00 Gareth Smith & team
Review of status. Castor Nameserver has been migrated from Pluto to Neptune as that shows some better behaviour. 2010-02-01 15:30 Gareth Smith & team
Review of status. Final issues (location of Oracle Archive logs outside 'ASM', connection issues) resolved. Castor testing OK. 2010-02-02 09:00 Gareth Smith & team
Castor Restart 2010-02-02 11:00 Castor team