RAL Tier1 Incident 20111031 Castor ATLAS Outage

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 29th September 2011: Castor Atlas Outage Caused by Bad Execution Plans

Description:

The Atlas Castor instance stopped working at 21:37 local time on Saturday 29th October. The CASTOR and Database oncall teams were able to get the instance working on Sunday at midday. The instance experienced further problems on 23:00 on this same day, which was finally fixed by applying a new hint to the database that was provided by CERN at 14:15 on Monday 31 October.

Impact

An outage on the Atlas Castor instance was backdated to start on 29 Oct at 19:50 till 31 Oct at 15:26. The total outage was for 1d:18h:36m split over two separate outages.

Timeline of the Incident

When What
Saturday 29 October 21:37. All ATLAS transfers to RAL start failing
21:57. GGUS Alarm ticket sent to RAL saying that transfers were failing. Primary oncall and CASTOR oncall (Matthew Viljoen) are called but fail to fix the situation. Castor oncall recommends that FTS is throttled back.
22:58. FTS channels are throttled back to 25%
Sunday 30 October 02:00 Note change of local clocks from British Summer Time to Greenwich Mean Time (=UTC).
05:24. CASTOR LSF jobs are still failing. Both atlas_pilot and atlas_prod batch limits are reduced from 3500 to 500.
08:30 Castor oncall contacted Chris Kruk and Shaun de Witt to discuss problem. SdW responded and checked diskservers and saw a lot of connections on diskserver in CLOSE_WAIT status connected to SRMs. Rebalanced srm back end daemon to try and distribute load.
09:00 No improvement. SdW called MV and suggested DB on call be contacted.
09:30 Keir logged in and looks at Neptune database. Confirms that poor performance is seen on Atlas stager. the performance is from two separate Queries, one an update statement, the other a delete.
10:00:1 Keir sees that the delete statement is easily taking the most resources and focuses on this. He confirms that it's using a poor execution plan that takes too many resources and slows the system down. The problem is that the statistics on the table are stale causing Oracle to choose a poor execution plan. The fix is found to be gathering stats on the castor_stager_atlas.newrequests table, flush the shared pool on that node. The performance difference is seen immediately. The node comes back down to normal running levels

Keir then looks at the other query which is an update statement. This again looks to take more resources than is needed BUT it is running spasmodically and from the DB perspective doesn't look to interfere with the running of the system. Keir investigates further into the system and sees that it's involving hotblocks, i.e. multiple sessions are trying to pull hit exactly the same filename in castor_stager_atlas.castorfile.

12:00:15 Keir sent email regarding two queries which seemed to be running sub-optimally. On analysis, one of these was from the castor putStart procedure and one was from the CASTOR requestToDo procedure. Shaun called Keir. Shaun killed all running and pending LSF jobs. This combined with Keir's changes seemed to improve things.
Monday 31 October 03:18. Primary on-call updates GGUS ticket noting another callout and reporting further reduction in the number of jobs Atlas may run on the RAL batch farm. Also notes will not put the site in downtime again so that the Castor team can investigate while the system is under load.
08:37. Atlas update GGUS ticket to say they have put UK cloud to "brokeroff"
09:00. Investigations point to a very poor database performance on PUTs, leading to time-outs in LSF. GETs are ok. The database team investigate and confirm that a major update query is doing a full tablescan when it shouldn't be. Attempts are made to force Oracle to create a more optimal explain plan, but these are unsuccessful.
12:00. Database team contact CERN to see if they're able to provide a solution to force Oracle into using a better execution plan.
14:20. Nilo replied with several options. The easiest is a simple procedural code update on the query with a lengthy hint. This is implemented and the query runs much more efficiently. The database team informs Castor operations that a fix is in place and invites them to test it within CASTOR.
15:26. Internal testing by CASTOR team is successful, and public interface and SRM daemons are started and the Outage is ended.

Incident details

The Atlas Castor instance stopped working at 21:37 local time on Saturday 29th October. At first the problem was thought to be load related and action was taken the reduce the load, initially by reducing the number of FTS channels and then, during the night, by reducing the number of Atlas batch jobs running at the RAL Tier1. However, these actions were not successful and the problem was only resolved on the Sunday morning when database load issues were understood and the database performance improved by gathering statistics and flushing some internal database memory (the shared pool).

The Atlas Castor instance experienced further problems on 23:00 on this same day (Sunday), which persisted through the night. The following (Monday) morning as staff arrived at work investigations continued. The issues were understood to be caused by a poor execution plan being used by Oracle but during the morning attempts to get the database to use a better plan were unsuccessful. At the end of the morning a request was made to CERN for assistance and a reply received from there with a new database 'hint' enabled the database performance to be fully restored and for the outage to be ended at around 14:15 that day.

Analysis

The root cause of the performance problem was the adoption of a bad execution plan by the Oracle database. Initial attempts to work around the problem by reducing load were unsuccessful. However, once the problem had been traced to a bad execution plan there was difficulty in forcing Oracle to adopt a better one until an appropriate database hint was received from CERN.

Follow Up

Issue Response Done
Understand the Root Cause. The root cause of the problem was a bad execution plan being adopted by the Oracle database. Oracle automatically modifies these execution plans with the aim of optimising database performance. However, in this case the execution plan adopted was very poor. Future moves to Oracle version 11g will allow better control over freezing execution plans, and/or triggering some action when an execution plan is changed. Yes
During this problem the database server was showing very high load but this did not trigger an alarm. Check monitoring of the database servers to flag load issues. No
Some delays in dealing with the issue. We note the additional difficulty caused by the timing of this problem, with the first call being late on a Saturday evening. However, there were some communication difficulties within the team. These would be improved by ensuring actions are tracked within a single ticket and that the handover to day-time staff ensures all relevant people are brought up-to-date with events quickly. Yes

Summary Table

Start Date 29 October 2011
Impact >80%
Duration of Outage 1 day 18 hours 36 minutes
Status OPEN
Root Cause Software Bug
Data Loss No