RAL Tier1 Incident 20081018 Castor Atlas outage
Site: RAL-LCG2
Incident Date: 18/10/08
Severity: Field not defined yet
Service: CASTOR
Impacted: ATLAS
Incident Summary: High network and CPU load probably caused by cache fusion on the ORACLE back end, caused degraded response to the SRM front ends. Owing to the high ATLAS load at the time, these were unable to keep up. Incorrect initial diagnosis of the problem caused further complications as the network configuration was incorrectly modified. We are unable to explain why cache fusion was triggered in the first place but it is presumably caused by the increased transaction rate as ATLAS increased the number of transfers. A manual recompute of the ORACLE statistics resolved the problem.
Type of Impact: Downtime
Incident duration: 55 hours (approx)
Report date: 24/10/08
Reported by: Andrew Sansum
Related URLs: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=36079 (internal access only)
Incident details:
On 18/10/2008 at 02:40 High network rate on CASTOR SRM database RAC private network. Subsequently attributed to cache fusion
On 18/10/2008 at 03:08 First SAM test failure on srm-atlas
On 18/10/2008 at 04:10 Pager call to on-call
On 18/10/2008 at 04:30 Initial srm restart failed
On 18/10/2008 at 04:50 CASTOR on-call called
On 18/10/2008 at 06:37 CASTOR on-call proposes possible hardware fault (or system configuration error) on one SRM (srm0384) owing to unbalanced load between both front ends and asymmetric count of established connections. srm0385 apparently OK.
On 18/10/2008 at 10:30 Proposal made to remove srm0384 from DNS round robin and continue with srm0385
On 18/10/2008 at 10:40 Call-out Tier-1 deputy "network expert" and discuss if we can get site networking to update DNS. Conclusion is not possible out of hours. Plan B is agreed to move srm0384 IP address to srm0385 and terminate srm0384. Primary on-call proceeds to implement.
On 18/10/2008 at 11:40 Network expert signs back in and says that on reflection IP address stealing will not work because of certificate matching problems.
On 18/10/2008 at 11:50 srms apparently returned to standard configuration. High database back end noted as possibly relevant to problem. cdbc09 is pulling pages from castor151 at 60MB/s. No agreement between team that front end hardware configuration/fault has occurred.
On 18/10/2008 at 13:25 Database admin call-out issued - admin responds that he will have a look when he gets back
On 18/10/2008 at 14:53 apparently OK at our end but ATLAS transfers still failing
On 18/10/2008 20:09 Errors still continuing - transfers failing. CASTOR expert reports network problem to front end srm servers and unable to carry out more diagnosis.
On 19/10/2008 08:03 D/B admin reports that back end busy but he has made tuning changes and it seems better. Seem to be network connectivity problems with front end that need to be resolved.
On 19/10/2008 19:10 Primary on-call gives up, concluding that we have a network fault that cannot be resolved until primary Tier-1 network expert is available.
On 20/10/2008 09:30 Incorrect host routing table (caused by IP address stealing effort) identified as cause of problems.
On 20/10/2008 10:30 Service resumed
We have reviewed srm0384 and srm0385 and found no differences in hardware or o/s configuration. We reviewed request transfer load on both front ends during the period. We have noted that srm0384 was handling significantly fewer requests than srm0385 and had a high failure rate. We do not understand the difference between the way the two srms reacted to the high load on the back end database. We have investigated what may have caused the back-end database server cdbc09 to suddenly start transfer data from another server (castor151) in the same Oracle RAC and have been unable to reach a conclusion. It is likely that the change in access pattern and use on the SRM database eventually invalidated the statistics, which may have caused database inefficiency, but it is implausible that castor151 held any data relevant to the srm database.
Future mitigation:
- Do not plan to investigate cache fusion problem further unless we see a recurrence. But we have added a high CPU load trigger to Oracle monitoring in order to increase likelihood of detecting cause more quickly.
- Have agreement in principle that our site network team will respond to out of hours calls to make emergency DNS changes. We have to implement a process for doing this.
- Will not try IP address transfer again for certificated hosts!
- Need remote access to APC controllers/IPMI to allow remote power cycle
Related issues:
None
Timeline
Date | Time | Comment | ||
---|---|---|---|---|
Actually Started | 18/10/08 | 02:40 | Cache fusion load started | |
Fault first detected | 18/10/08 | 04:10 | Multiple successive SAM test failures on srm-atlas caused call-out of primary on-call | |
First Advisory Issued | 18/10/08 | 08:06 | Downtime announced in GOCDB | |
First Intervention | 18/10/08 | 04:40 | High error rate on srm0384, on-call tried restarting SRM server but errors continue | |
Fault Fixed | 20/10/08 | 10:30 | After host networking problem resolved | |
Announced as Fixed | 20/10/08 | 10:25 | Downtime ended in GOCDB | |
Downtime(s) Logged in GOCDB | 18/10/08 07:06 UTC | 20/10/2008 09:25 UTC | Unscheduled downtime on ATLAS Instance | |
Other Advisories Issued | 18/10/08 | 08:17 | Atlas UK computer operations list notified | |
Other Advisories Issued | 18/09/08 | 08:34 | GRIDPP Users list notified | |
Other Advisories Issued | 20/10/08 | 10:26 | ATLAS UK computer operations list advised problem was resolved | |
Other Advisories Issued | 20/10/08 | 10:36 | GRIDPP-Users notified that Instance is back up
|
) |