RAL Tier1 Incident 20130628 Atlas Castor Outage

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 20130628 Atlas Castor Outage======Description:=

The ATLAS CASTOR instance encountered a problem where large numbers of invalid subrequests got stuck in the stager database, causing the instance to become blocked. After a few hours of getting worse, the instance eventually became completely inaccessible around lunchtime on Friday 28th June. This outage continued until Saturday evening, when a fix was found and service restored. The instance was left active but in downtime overnight for observation, and the downtime was ended on Sunday morning.

The problem was preceeded by large increase in memory usage by the stagerd daemon. This symptom was initially dealt with by the on-call team by restarting the daemon, but while this did alleviate the symptoms, it did not clear the root cause (now known to be invalid subrequests) and so the daemon's memory usage continued to increase after the reset to zero.

Problems then resurfaced early on Friday afternoon, and the staff present were unable to resolve the issue. After several downtime extenstions Shaun found a fix at around 23:00 on Saturday evening by deleting all invalid subrequests from the ATLAS stager database. This restored service, and the downtime was ended at 11:36 on Sunday morning.

Impact

The ATLAS CASTOR instance was completely unavailable from 14:30 on 28th June to 23:00 on 29th June, with a period of degraded service beforehand. The outage was not ended until 11:36 on Sunday 30th June after on-call staff had checked the Atlas Castor instance ran OK overnight.

Timeline of the Incident

When What
27th June 01:00 Steady increase in memory usage on ATLAS stager machine turns into a sharp spike. Historically, stagerd's memory usage increases at a rate of roughly ~3GB/month, but over the next 24 hours the memory usage on lcgcstg01 increases from 2GB to 12GB.
28th June 01:47 Test for swapping on ATLAS stager fires and reports that the machine has a problem.
28th June 05:45 Primary on-call (Ian) acknowledges alarm and calls CASTOR on-call (Rob) as per spreadsheet. After a brief discussion Ian restarts the stagerd process on lcgcstg01. Swap frees up, the alarm clears. However, shortly after the restart the machine's memory usage starts to rapidly increase once again.
28th June 11:44 First appearance of the following error message in transfermanager logs on lcgclsf01 (staff unaware of this at this point):

LVL=Error TID=6528 MSG="Exception caught in Dispatcher thread" Message="ORA-01403: no data found ORA-06512: at 'CASTOR_STAGER_ATLAS.TRANSFERTOSCHEDULE', line 93 ORA-06512: at line 1 " Type="cx_Oracle.DatabaseError"

28th June 12:57 GGUS ticket 95160 arrives from ATLAS reporting large numbers of transfer errors starting mid-morning.
28th June 12:00 Failing atlas tests puts first of ATLAS batch queues offline.
28th June 13:05 ATLAS experiment rep (Alastair) reports failing ATLAS jobs to Rob, starting at around 1220.
28th June 14:23 Internal Functional Nagios test start failing and logs a ticket.
28th June 15:00 Rob attempts documented fix for stuck subrequests in TM (ORA-01403/ORA-06512) by repeatedly setting the status of the oldest subrequest to status 9 (FAILED_FINISHED)in the ATLAS stager DB. Attempt halted after ~10 iterations due to unexpectedly early timestamps on some of the requests.
28th June 15:30 Outage declared for srm-atlas in GOC DB. Backdated to start at 14:30. Ending at 18:00.

Rob informs Matt and Shaun of the situation, both of whom were attending a conference.

28th June 15:30-17:45 Staff (Primarily Rob, Eddy and Carmine) continue to investigate problem and attempt a variety of fixes to little effect. Checks/attempts included:
- Reboot of ATLAS stager machine (1540).
- Simultaneous restart of the SRM daemons on all four ATLAS SRM machines.
- Complete restart of ATLAS-specific database instances (1700).
- Checks for interference by recently upgraded Puppet clients and master.
28th June 17:45 Outage extended in GOC DB until 13:00 the next day (29th June).
29th June 21:00 Relevant staff (On-call & Castor teams) discuss and agree to have phone conference at 10:15 the following morning.
29th June 10:15 Meeting of on-call and CASTOR staff. Attendees: Rob, Ian, Carmine, Shaun. Staff agree that with no imminent fix in sight, downtime should be extended.
29th June 11:05 Outage extended in GOC DB until 14:00 on Monday (1st July).
29th June 23:00 Shaun deletes subrequests of type 'DISKCOPYSUBREQUEST' from the stager DB and restarts services. Instance comes up again and tests pass. Ad hoc CASTOR team meeting (Shaun, Rob, Matt) agrees to leave instance as-is until Sunday morning and then independently test again.
30th June 10:30 All three CASTOR team members independently test instance and agree to end the downtime. Rob contacts primary on-call to confirm (Ian).
30th June 11:36 Outage ended in GOC DB.

Incident details

Following a call-out overnight Thursday/Friday 27/28 June the Atlas stagerd process was restarted. Two regular members of Castor staff were out of the office on the Friday day and the one remaining member of staff was unaware of developing problems during Friday morning.

Around lunch time on Friday 28th June Atlas reports made it clear there was a problem with the Atlas Castor instance. Initial thoughts were that the problem was in the Atlas SRM database as a problem in this area had been seen earlier in the week. During the afternoon investigations continued both by the Castor staff present and the database team. The error messages found correlated with a known problem that was documented in the local Wiki. However, applying the documented fix for that failed to resolve the problem. Furthermore, this fix entailed updating old stuck subrequests - and a check showed these were very old. At that point further use of this 'fix' was stopped.

Staff on site continued investigations throughout the afternoon. This included verifying updates applied by the 'puppet' configuration tool that had been updated in the days beforehand. However, no significant changes were found (and indeed the cause of the problem was subsequently traced to be elsewhere). It was not possible to get assistance from those members of the RAL Castor team who were away from RAL at this point.

During Friday evening one member of the Castor staff investigated the problem, but without success. It was agreed to review the situation the following morning. By the time of the review on Saturday morning there had been no further progress and it was agreed to extend the outage in the GOC DB until the Monday (1st July). On the Saturday evening another member of the Castor team (Shaun) was able to investigate more thoroughly and resolved the problem by deleting all invalid sub-requests from the ATLAS stager database. At around 11pm that evening the Castor instance started working again. An ad-hoc meeting of the Castor team decided to leave the instance running with the GOC DB Outage in place overnight to confirm that all was working OK. The following morning the instance was checked and the GOC DB outage ended.

Analysis

Although there was an underlying problem the staff present on Friday morning were not aware of it. The problem that called out overnight was not followed up during Friday morning. During that day the two more experienced members of the Castor Team were out of the office. An earlier follow up may have identified the seriousness of the problem earlier and given more opportunity to understand it and find a fix during the working day on Friday.

The problem was finally resolved when the one member of staff with the required expertise was able to investigate thoroughly on the Saturday evening. The problem was intricate with two problems compounded. It is noted that should local RAL staff not have been able to fix the problem the Atlas Castor instance would have stayed down until assistance could be obtained from experts at CERN, probably on the Monday. This highlights a strategic problem of having only a small number of staff with deep Castor experience.

Subsequent checking with the developers at CERN showed that the second problem was known and was the subject of a Castor hotfix. Castor hotfixes were not regularly applied applied at RAL. The cause of the first problem is not understood. Similar problems have, however, been encountered occasionally before and in that case a procedure was documented. However, the specifics of that procedure did not work in this case.

The known problem had been identified (by RAL at the end of 2012) and was fixed in the 2.1.12_4_3 hotfix. A later Castor hotfix (2_1_12_4_5) had been applied - although this is not an official hotfix. However, hotfixes 2_1_12_4_[1-3], which are official hot fixes, were not applied. Furthermore the RAL Castor team was unaware that hotfixes are NOT cumulative (i.e. 2_1_12_1_2 would not include the fix for 2_1_12_1_1). These fixes are included in Castor version 2.1.13, to which an upgrade was planned (and carried out) shortly after this problem on 10th July.

Follow Up

Issue Response Done
The bug that triggered the second problem was fixed in a hotfix that had not been applied at RAL. Practice at the time was not to apply hotfixes. Formalise the policy for the application of Castor hotfixes. No
Lack of staff expertise available at the time to quickly resolve the problem. Review Castor Team complement/skills. No

Related issues

As noted in the analysis, the underlying problem had been seen before at RAL.

Reported by: Shaun de Witt (2 July 2013)

Summary Table

Start Date 28 June 2013
Impact >80%
Duration of Outage 45 hours (~2 days)
Status Open
Root Cause Software Bug
Data Loss No