RAL Tier1 Incident 20111215 Network BReak Atlas SRM DB

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 15th December 2011: Network Break Followed by DNS Problems then Atlas SRM Database Problem

Description:

During the late evening of Thursday 14th December there were two breaks in the external connectivity to RAL. These were fixed promptly but resulted in very slow response from one of the DNS servers used by some of the Tier1 nodes. Despite efforts during the night it was not possible to resolve all problems, notably on the Atlas SRM. The following morning this was investigated by the Database Team and an underlying ORACLE bug was found to be the problem. After significant investigation and failed attempts to work around the bug, it was concluded that the best solution was to migrate the Atlas SRM Database to new hardware. This migration being part of a planned move scheduled to take place shortly.

Impact

The two separate network breaks, each lasting around an hour, late in the evening of Wednesday 14th December resulted in all RAL Tier1 services being unavailable for the duration of the breaks. All services recovered from these breaks with the exception of the Atlas SRM which was unavailable for a total of 18 hours from 21:15 on Wednesday 14th to 15:15 on Thursday 15th December.

Timeline of the Incident

When What
14/12/2011 21:18 First ticket (alarm) noting network problems (RT #91761). Link to RAL broken.
14/12/2011 21:42 DNS problems reported (RT #91764)
14/12/2011 22:00 (approx) Networking Emergency Number called to ensure their team is aware of problem. Confirmation received of staff attending site.
14/12/2011 22:05 (approx) Network connectivity restored by Networking Team.
14/12/2011 22:50 DB On Call (Carmine) called due to ATLAS SRM problems. Confirms there are no performance problems on ATLAS SRM machine (cdbc13). Drop in load noticed assumed to be due to network problems.
14/12/2011 23:03 Castor On-Call (Shaun) logs that all Castor instances up but problems with Atlas SRM. All SRMs except ATLAS are passing SAM tests. On ATLAS, worker nodes are going fine, but transfers through SRM are failing with a time-out message. Have asked Carmine (DB On-Call) to check load on SRM-ATLAS database.
14/12/2011 23:15 Start of second Network outage. (From checking network monitoring later).
14/12/2011 23:45 (approx) Primary On-Call & Production Manager have ensured Networking team aware of second failure. Pager is repeatedly calling as alarms cannot be acknowledged. Agree to switch off pager. Production Manager will check status later.
15/12/2011 00:15 End of second Network outage following fix by networking team who were again called in. (From checking network monitoring later).
15/12/2011 03:30 - 04:30 Following checks Production Manager can again view systems. However, problems with one of the DNS nodes ("Chilton") which has a very slow response. Atlas still failing SAM tests. CMS tests cleared during the checks. Re-contacted Primary On-Call who acknowledged the alarms and Castor On-Call to investigate.
15/12/2011 05:30 Castor On-Call has updated resolv.conf on the SRM nodes to remove faulty DNS but not manager to resolve the issue with atlas SRMs. Agree with Production Manager to involve Database team in the morning.
15/12/2011 07:45 Rich picks up issue when arriving at work. ATLAS SRM database is accepting connections however not much work is being submitted due to ATLAS not submitting any more jobs
15/12/2011 09:20 Rich notices that a SQL statement in ATLAS SRM is consuming more resources than usual (SELECT castorFileName, fileSize, fileId, nsHost, id, status FROM UserFile WHERE id = :1;). Rich runs the SQL Tuning Advisor which recommends to assign a SQL profile to the statement. SQL profile implemented.
15/12/2011 09:30 DB Team receive alert from cdbc13 indicating extremely heavy load.
15/12/2011 09:30 As a precaution Rich drops the SQL profile created earlier.
15/12/2011 09:55 AWR reports show top wait event on cdbc13 is for “Resmgr:Cpu Quantum”. This is part of Resource Manager used to throttle CPU to sessions and is not used on the CASTOR databases. Initial finding on Oracle support suggest a bug (8221960) related to severe throttling on CPU resources for sessions (apparently fixed in the 10.2.0.5 patchset which Neptune is already running). No one-off patch available for 10.2.0.5.
15/12/2011 10:00 DB Team decide to shutdown the instance on cdbc13 to try and remove the wait event.
15/12/2011 10:15 Instance is restarted the “Resmgr:Cpu Quantum” wait event immediately occurs again and load on cdbc13 rises to extreme levels within minutes.
15/12/2011 10:20 Atlas batch work throttled back. (No new jobs to start.)
15/12/2011 10:40 Attempts are made to disable Resource Manager completely with the database (via a system parameter change). Instance is restarted again, however the change has no effect with the Resource Manager wait event reappears as soon as the connections are reestablished and load continues to rise. Repeated attempts to manage the load via tuning and disabling of Resource Manager fail.
15/12/2011 11:20 Atlas FTS channels to/from RAL turned right down.
15/12/2011 11:30 Atlas SRMs stopped and Running Atlas batch jobs paused.
15/12/2011 10:55 Extended outage on srm-atlas to 16:00.
15/12/2011 11:44 DB Team reboot cdbc13 as a final attempt to solve the problem. Instance is restarted and problem remains.
15/12/2011 11:45 After discussions between CASTOR and Database team, the decision taken to move ATLAS SRM service to new hardware (due to have been put into production on 5th Jan).
15/12/2011 12:03 ATLAS SRM data exported from Neptune.
15/12/2011 12:40 ATLAS SRM successfully imported on Neptr26. CASTOR team change connection settings to point to new database.
15/12/2011 ~13:00 Shaun starts ATLAS SRM service.
15/12/2011 13:15 Shaun confirms that ATLAS SRM is working. DB Team confirm there has been no re-occurrence of earlier bug.
15/12/2011 13:30 DB Team continue to tune queries which are taking up more resources than necessary. SQL profiles applied to 2 queries. 1 query to StageRequest table is still heavy.
15/12/2011 13:30 Start to un-pause Atlas batch work.
15/12/2011 14:45 Allow new Atlas batch jobs to start.
15/12/2011 15:15 End Outage for srm-atlas in GOC DB. Set GGUS ticket (#77470) as solved. Batch limits for Atlas returned to normal values.
15/12/2011 15:30 Shaun suggests removing entries from StageRequest table which are stuck in status=4 and handler=1. Believed to be left over from DNS problems the previous night. Performance immediately improves.
15/12/2011 15:38 Atlas FTS channels to/from RAL set at 50% of nominal values. (But no transfers pending.)
15/12/2011 16:15 Atlas load up FTS channels to/from RAL Tier1 and transfer rates rise.
16/12/2011 09:16 Atlas FTS channels to/from RAL returned to 100% of nominal values.

Incident details

On the evening of the 14th December there were two separate breaks in the RAL site networking both of which disconnected RAL rom the outside world. These in turn triggered problems for the DNS servers, one of which in particular then gave very slow response to DNS queries. This appeared to result in problems for the ATLAS SRM interacting with the database (Neptune3). Database Services (Carmine) was called at around 22:40 and confirmed that database performance was normal. In the morning of the 15th problems were still being observed which resulted in very poor performance of the ATLAS SRM database node (cdbc13), despite repeated attempts to bring the load down (via SQL tuning and parameter changes) eventually it was decided to move the ATLAS SRM service to the new database hardware (which had been scheduled as part of an intervention to take place in January 2012). This poor performance was due to an ORACLE component that we do not normally rely upon (the Resource Manager) rather than the software itself. Moving the ATLAS SRM data/service to the new database (Neptr26) resolved the performance problem that was being experienced.

Analysis

There are multiple issues at play here. The initial DNS problems are clearly the cause of some disruption in the ATLAS SRM application and resulted in the problems experienced late on 14/12/2011 and into the morning of 15/12/2011. The DNS issue is also likely to have been the cause of the entries in the SRM ATLAS which were not cleaned up and contributed to sub optimal performance (from the DB Team perspective) after the migration. The issue relating to database performance and the Resource Manager problems is more difficult to diagnose. The coincidence of the initial SQL tuning (and profile being added) is suspicious but this is a routine operation that has been carried out many times across multiple systems and has never had this kind of effect.

It may be noted that a number of separate issues were uncovered during this incident. Specifically there was no way of acknowledging the alarms to prevent the repeated paging during the network breaks. As a result it was necessary to turn the pager off completely which would have meant we could not be alerted to other problems. Also, before this incident it was known that one of the DNS servers was prone to problems. It would have been preferable to remove this server from the DNS configuration of all critical services in advance. This had been done for many Tier 1 systems (those that are configured via Quattor and the Castor head nodes). However, servers it had not been done for included the SRMs and database servers.

Follow Up

Issue Response Done
Root Cause The root cause is not understood, although it appears to be a bug in ORACLE which is listed as being fixed in the version that we run. We cannot be certain that it will not reappear. However, moving the database schema to the new hardware fixed the problem and we plan to move all other databases from the RAC that exhibited the problem as soon as possible. In the meantime we have changed our emergency procedures to do this in advance of the planned date should this problem recur. Yes
DNS Performance Issues after Network Break Firstly, all Tier1 systems will be configured not to use the problematic DNS server. Secondly plans are underway for the problematic DNS server to be replaced. Yes
Repeat Pager Alarms A method to be able to acknowledge alarms if the primary RAL network is down should be investigated. (This has been implemented - September 2012) Yes


Reported by: Richard Sinclair, Matthew Viljoen, Gareth Smith on 21 December

Summary Table

Start Date 14 December 2011
Impact >80%
Duration of Outage 18 hours
Status Closed
Root Cause Software Bug
Data Loss No