RAL Tier1 Incident 20110927 Atlas Castor Outage DB Inconsistent

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 27th September 2011: Atlas Castor Outage Caused by Database Inconsistency.

Description:

The Atlas Castor instance stopped working at around 04:00 local time on 27th September. A callout was raised but this was at the time staff were about to travel to work understanding the problem only began at the start of normal working hours. This problem was traced to an inconsistency in the Oracle Database behind Castor. Once this was resolved the service was restarted and the Outage ended in the GOC DB at 15:30 that day.

Impact

Atlas Castor instance down for 11 hours 30 minutes.

Timeline of the Incident

When What
Times for Tuesday 27th September in BST.
03:38 Stager started processing request ad8519c1-4920-04b6-e040-f6824eb46c3f
03:40 Stager picked up same request again
03:41 First jobs start failing to be scheduled in ATLAS job manager: MSG="Exception caught selecting a new job to schedule in DispatchThread::select" Type="Internal error" Message="Error caught in jobToSchedule. ORA-01403: no data found"
04:00 FTS transfer failures seen in Atlas channels. (Seen later in FTS ganglia plots).
04:57 Errors seen on Atlas Stager Instance (Neptune3): Tue Sep 27 04:57:13 BST 2011 ORA-01555 caused by SQL statement below (SQL ID: bzwjdmpu2wxk5, Query Duration=2785 sec, SCN: 0x000a.72413bd6): Tue Sep 27 04:57:13 BST 2011 SELECT id FROM SubRequest WHERE request = :1 FOR UPDATE This isn’t a problem in itself and has been seen before on other systems but is not very common. Database team see this but do not notify castor team as it’s not a serious error.
05:33 GGUS Team Ticket raised by ATLAS (#74686)
06:30 RT Oncall ticket was raised due to SAM tests failing (#88931) - On-call paged. Person On-call decides to postpone investigations until staff arrive at work.
07:45 GGUS ticket raised to be an ALARM ticket.
08:23 GGUS ticket acknowledged by RAL Tier1 On-call.
08:25 CASTOR staff (Matthew) arrives at work and start investigations and realizes that no jobs are being processed within CASTOR Atlas. Restarting head node services does not help.
09:06 GOC DB Outage declared on srm-atlas.gridpp.rl.ac.uk. Start time set to 04:00 (BST), end time 11:00.
09:30 Shaun tells Keir that there is a serious issue with logical corruption in the Atlas Stager Database. Castor Atlas is now in downtime. There are missing entries in id2type that should correspond to sub-request. More investigation needed
09:45 Decision to populate missing sub-request entries in id2type is taken. Keir locks the stager atlas account and backs up the castor_stager_atlas user ahead of doing this.
09:49 GOC DB Outage extended (by adding another entry) until 17:00 (BST).
09:50 Shaun De Witt/Chris Kruk bring down Atlas SRM front nodes as they can’t user the stager
10:00 Brian Davies stops all castor services on DLF,LSF and stager machines
10:10 Export completed and script to repopulate id2type with data from subrequest begins
10:13 Re-population completed. Brian Davies told to restart service (As Shaun is in interview and Chris is in meeting)
~10:30 Contact Castor experts at CERN for advice.
10:45 Brian mentions there are memory issues and requires more experience to restart. Shared memory only had one attachment. rmmasterd on LSF machine found to have failed restart. Successive stop/start failed. Eventually rmmasterd ( and subsequent LSF restart) successful. Kier mentions many SRM DB deadlocks still in place which could do with clearing. lcg-cp test by BD still fails
10:55 Alastair Dewhurst pulls Chris Kruk out of meeting to restart Castor service. Problem described to CERN??
11:00 Service restarted but problems persist. More investigation done by Chris Kruk
11:37 Decision to rebuild id2type from scratch taken from upgrade script (2.1.8-4 2.1.8-6)
12:15 Rebuild complete. Rebuild complete and no more orphaned subrequests in id2type table. Chris Kruk notified and Castor restarted. Problem is still seen. Chris Kruk investigates
13:20 Shaun pulled out of interview to advise.
14:30 Shaun gets no update from CERN by this time and decides to failed all pending sub-requests & restart Castor. This is done and testing show problem has gone.
15:00 Testing of Castor ongoing. SAM tests start to be successful.
15:30 GOC DB Outage ended. GOC DB interface defaults update set this to Warning.
15:55 GGUS ticket set solved, with text Atlas castor services are now back up. SAM tests are passing. The outage has been ended in the GOC DB. Setting ticket solved.
16:08 Atlas restart their requests in the system. High load seen but not other problems.
16:55 ELOG entry indicating Atlas RAL FTS channels set up to 50% of nominal values.

Incident details

The incident started shortly before 04:00 on Tuesday 27th September when the Atlas Castor stager picked up the same request twice. The monitoring system detected failing SAM tests and alerted the Primary On-call person at around 06:30. There were no other alarms apart from the failing SAM tests. As staff would be on their way to work soon the Primary on-call decided to wait until staff were at work to commence investigations. Atlas submitted a GGUS Team ticket at around 05:30 however this does not alert the out of hours on-call. Atlas converted the ticket to an Alarm ticket at 07:45, this does page the on-call person although they were already aware of the problem. During the morning the Castor and Database Teams investigated the problem and produced a solution. However, this work was made more difficult as some of the Castor team were interviewing for the appointment of new staff that day.

A number of database operations were made with the aim of clearing the problem. In the end the affected database table (id2type) was rebuilt from scratch. This still did not resolve the problem and it was necessary to fail all pending sub-requests & restart the Atlas Castor instance before the system would run.

Analysis

A GET request was submitted for a file on a draining disk server. This creates two subrequests in CASTOR; one for the disk-2-disk copy and one for the actual get. The second subrequest is put into a state WAITSUBREQ, waiting for the disk copy subrequest. In this case (and on post analysis of a similar incident in July), the disk copy subrequest completed successfully. Before the second subrequest was started, the user issued an ABORT request for the transfer. The processing of the ABORT removed the entry in the id2type table, but did not correctly update the status the the second subrequest to indicate it had been aborted. Hence the CASTOR JobManager correctly identified this as a pending subrequest and tried to process it, but the abort had deleted the id2type entry. What is not currently clear is why the pending subrequest was not correctly updated to indicate it had been aborted, but the suspicion is a race condition occurring ONLY when an abort arrives a disk-2-disk copy and an associated GET request.

While it has not been proven, there exists a suspicion that ABORTs can leave subrequests with missing entries in id2type which never get cleared. In most cases this is not an issue; it is only when the aborted subrequest has another request pending upon it. This is normally only in the case of disk-2-disk copies or tape recalls and the handling of abort requests for the latter are singificantly different and may not be subject tp the same problem.


There was some delay in getting staff fully working on this issue. This was caused by a number of factors:

  • In order to reduce false alarms, the existing tests wait for multiple SAM test failures before calling out.
  • The arrival time of the first alarm to the on-call person was shortly before staff would depart for work anyway.
  • Some of the key staff were interviewing that morning.


Follow Up

Issue Response Done
Root Cause The root cause of the incident needs to be understood so that a fix can be implemented. (Note: September 2012. This problem has not been seen since this incident. Castor is will soon be upgraded to version 2.1.12 which removed the "id2type" table involved in this incident. Further investigation not worthwhile.) N/A
Time to understand and resolve the issue. Document how to identify this type of problem and the procedure to resolve it. Yes
Time taken to identify a complete failure of the Atlas Castor instance. Set-up a Nagios test to monitor for this type of failure and call out. Yes
Reduce the chance of this happening out of hours. Since this incident appears to have happened due to a disk server in Draining, until the root problem is fixed, we will change our procedure to reduce the number of production disk servers in draining out of hours. Yes

Reported by: Gareth Smith 28th September 2011

Summary Table

Start Date 27 September 2011
Impact >80%
Duration of Outage 11hours 30 minutes
Status Closed
Root Cause Software Bug (not yet confirmed)
Data Loss No