RAL Tier1 Incident 20081018 Castor Atlas outage

From GridPP Wiki
Revision as of 14:25, 24 October 2008 by Andrew sansum (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Site: RAL-LCG2

Incident Date: 18/10/08

Severity: Field not defined yet

Service: CASTOR

Impacted: ATLAS

Incident Summary: High network and CPU load probably caused by cache fusion on the ORACLE back end, caused degraded response to the SRM front ends. Owing to the high ATLAS load at the time, these were unable to keep up. Incorrect initial diagnosis of the problem caused further complications as the network configuration was incorrectly modified. We are unable to explain why cache fusion was triggered in the first place but it is presumably caused by the increased transaction rate as ATLAS increased the number of transfers. A manual recompute of the ORACLE statistics resolved the problem.

Type of Impact: Downtime

Incident duration: 55 hours (approx)

Report date: 24/10/08

Reported by: Andrew Sansum

Related URLs: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=36079 (internal access only)

Incident details:

On 18/10/2008 at 02:40 High network rate on CASTOR SRM database RAC private network. Subsequently attributed to cache fusion

On 18/10/2008 at 03:08 First SAM test failure on srm-atlas

On 18/10/2008 at 04:10 Pager call to on-call

On 18/10/2008 at 04:30 Initial srm restart failed

On 18/10/2008 at 04:50 CASTOR on-call called

On 18/10/2008 at 06:37 CASTOR on-call proposes possible hardware fault (or system configuration error) on one SRM (srm0384) owing to unbalanced load between both front ends and asymmetric count of established connections. srm0385 apparently OK.

On 18/10/2008 at 10:30 Proposal made to remove srm0384 from DNS round robin and continue with srm0385

On 18/10/2008 at 10:40 Call-out Tier-1 deputy "network expert" and discuss if we can get site networking to update DNS. Conclusion is not possible out of hours. Plan B is agreed to move srm0384 IP address to srm0385 and terminate srm0384. Primary on-call proceeds to implement.

On 18/10/2008 at 11:40 Network expert signs back in and says that on reflection IP address stealing will not work because of certificate matching problems.

On 18/10/2008 at 11:50 srms apparently returned to standard configuration. High database back end noted as possibly relevant to problem. cdbc09 is pulling pages from castor151 at 60MB/s. No agreement between team that front end hardware configuration/fault has occurred.

On 18/10/2008 at 13:25 Database admin call-out issued - admin responds that he will have a look when he gets back

On 18/10/2008 at 14:53 apparently OK at our end but ATLAS transfers still failing

On 18/10/2008 20:09 Errors still continuing - transfers failing. CASTOR expert reports network problem to front end srm servers and unable to carry out more diagnosis.

On 19/10/2008 08:03 D/B admin reports that back end busy but he has made tuning changes and it seems better. Seem to be network connectivity problems with front end that need to be resolved.

On 19/10/2008 19:10 Primary on-call gives up, concluding that we have a network fault that cannot be resolved until primary Tier-1 network expert is available.

On 20/10/2008 09:30 Incorrect host routing table (caused by IP address stealing effort) identified as cause of problems.

On 20/10/2008 10:30 Service resumed

We have reviewed srm0384 and srm0385 and found no differences in hardware or o/s configuration. We reviewed request transfer load on both front ends during the period. We have noted that srm0384 was handling significantly fewer requests than srm0385 and had a high failure rate. We do not understand the difference between the way the two srms reacted to the high load on the back end database. We have investigated what may have caused the back-end database server cdbc09 to suddenly start transfer data from another server (castor151) in the same Oracle RAC and have been unable to reach a conclusion. It is likely that the change in access pattern and use on the SRM database eventually invalidated the statistics, which may have caused database inefficiency, but it is implausible that castor151 held any data relevant to the srm database.


Future mitigation:

- Do not plan to investigate cache fusion problem further unless we see a recurrence. But we have added a high CPU load trigger to Oracle monitoring in order to increase likelihood of detecting cause more quickly.

- Have agreement in principle that our site network team will respond to out of hours calls to make emergency DNS changes. We have to implement a process for doing this.

- Will not try IP address transfer again for certificated hosts!

- Need remote access to APC controllers/IPMI to allow remote power cycle

Related issues:

None

Timeline

Date Time Comment
Actually Started 18/10/08 02:40 Cache fusion load started
Fault first detected 18/10/08 04:10 Multiple successive SAM test failures on srm-atlas caused call-out of primary on-call
First Advisory Issued 18/10/08 08:06 Downtime announced in GOCDB
First Intervention 18/10/08 04:40 High error rate on srm0384, on-call tried restarting SRM server but errors continue
Fault Fixed 20/10/08 10:30 After host networking problem resolved
Announced as Fixed 20/10/08 10:25 Downtime ended in GOCDB
Downtime(s) Logged in GOCDB 18/10/08 07:06 UTC 20/10/2008 09:25 UTC Unscheduled downtime on ATLAS Instance
Other Advisories Issued 18/10/08 08:17 Atlas UK computer operations list notified
Other Advisories Issued 18/09/08 08:34 GRIDPP Users list notified
Other Advisories Issued 20/10/08 10:26 ATLAS UK computer operations list advised problem was resolved
Other Advisories Issued 20/10/08 10:36 GRIDPP-Users notified that Instance is back up


)