RAL Tier1 Incident 20080814 Castor unavailable following upgrade

From GridPP Wiki
Revision as of 10:30, 16 September 2008 by Andrew sansum (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Site: RAL-LCG2

Incident Date: 14 August 2008

Severity: <not defined yet>

Service: CASTOR

Impacted: Mainly ATLAS, sometimes CMS, sometimes LHCB.

Incident Summary: Following an upgrade to version 2.1.7, CASTOR became unusable for extended periods between 14th August and August 27th. the problem manifested itself as a unique constraint violation in the CASTOR request handler. No data loss/corruption occurred, but data could not be stored or retrieved. The exact cause is still unknown but a plausible theory is that a change in use introduced in CASTOR release 2.1.7, coupled with a reconfiguration of the Oracle service to a RAC uncovered an ORACLE bug that had not been experienced before. The problem was eventually resolved by a parameter change in the Oracle reconfiguration suggested by CERN. RAL is occasionally seeing other Oracle errors which appear to be unrelated to this fault but impact the service for short periods of time occasionally.

Type of Impact: Down (either unscheduled down or badly degraded)

Incident duration: 14 days approximately

Report date: 12 September 2008

Related URL:

http://www.gridpp.rl.ac.uk/blog/2008/08/20/problems-on-atlas-castor-instance/

http://atlasuk.blogspot.com/2008/08/ral-castor-down.html

http://atlasuk.blogspot.com/2008/08/ral-castor-restart-on-tuesday.html

http://atlasuk.blogspot.com/2008/08/castor-returns-but-at-risk.html

http://atlasuk.blogspot.com/2008/09/castor-stable-again.html

Incident details:

The incident was first detected by ATLAS and was also seen on OPS tests on 2008-08-14, following an upgrade to CASTOR 2.1.7 and migration to a new ORACLE RAC configuration on 13th August. It manifested itself as a unique constraint violation in the CASTOR request handler. We analysed the database at this point and discovered that the violation came from one specific table, and discovered anomalously large values in this table. At this point we were unsure whether other tables were affected by the problem so we took the decision to close the ATLAS instance while we looked into the impact on other tables, since propagating the corruption would have led to a worse situation. Analysis determined that these large values were only being inserted into one table (id2type). At the same time the database team investigated the ORACLE redo logs, and discovered that the unique constraint violations (which came from the same table) were being caused by attempting to insert a value of ‘00’ (zero-zero) into the id field (a number column in the table). It is unclear why this led to a constraint violation since the id value 0 does not exist in the table. We have surmised that the problems of the large values and the constraint violations are related since all of the large numbers end in 3 or more 0 values. We did report this fault on the CASTOR external operations mailing list at this time.

Once it was determined that the problem was localised, we performed additional tests to ensure the system was working optimally before reopening the instance to ATLAS. The system then ran fine for a few hours before the unique constraint violations occurred. During this time a number of staff undertook code review of the CASTOR code to investigate possible causes of the error since at the time we believed it was a problem in 2.1.7-15 which was not caught during testing. Once the error recurred we took the decision to take down the ATLAS instance again since all requests were failing with this problem and requested additional effort from CERN. On further analysis it was found that the GEN and PreProduction instances has also been affected by the problem. At this point it became clear that analysis needed to be performed in situ so we developed a tool to restart the request handler, which seemed to temporarily fix the problem for a period from a few hours to a few days, which would run overnight on all instances, and allow in-situ analysis of the problem. All experiments were made aware of this policy and it was agreed this was the best way forward.

The problem next occurred on LHCb, and CERN were able to perform some analysis of the problem during working hours. During this time it was determined that the correct values were being passed into the ORACLE layer, and it was at this point that the investigation moved to the ORACLE layer and the RAC configuration in particular.

The next occurrence which we were able to investigate (in the GEN instance) showed that by killing the ORACLE sessions associated with the request handler and forcing new connections to made also cleared the problem. This again implicated the ORACLE layer as the problem, although the 2.1.7 upgrade could not be eliminated as a possible source at this time.

On the 26th of August, Nilo Segura, the database expert at CERN suggested modifying a configuration parameter (_kks_use_mutex_pin=false). In the absence of any other clear path to progress the problem, this was applied by the database team at RAL to one of the two RACs (the one hosting LHCb and ATLAS) on the 27th August. After this time, the problem was not seen again on this RAC. However the problem occurred on the CMS instance at this time and additional analysis showed that the CMS SRM database was affected by a similar problem. This last piece of evidence localised the problem to the ORACLE layer since the SRM had not been upgraded recently. We took the decision to apply the same configuration change to the second RAC and this was done on the 2nd of September. Since then we have been monitoring the situation and have not seen a recurrence

In summary, we still do not know the underlying cause of the problem but it appears to be related to a reconfiguration of the databases to a RAC. A plausible theory is that a change in use introduced in CASTOR 2.1.7 uncovered an ORACLE bug. A configuration parameter change seems to have resolved the problem based on the current operational experience.

Future mitigation:

Following the May CCRC08 an upgrade to CASTOR 2.1.7 was considered by RAL to be essential in order to ensure that we would receive continued support from CERN (current and previous release) during data taking and the strong desire expressed by the experiments for a close match in versions. The upgrade to CASTOR 2.1.7 would address various performance problems we had experienced during the CCRC (although generally CASTOR had performed adaquatly) and was also expected to provide a working tape repack capability which was becoming essential. Prior experience of upgrades to 2.1.4 and 2.1.6 were excellent and the product had (eventually) passed certification and had been in operation at CERN for some time. We believe we correctly concluded that release 2.1.7 was ready to be deployed into production when we did.

A cut-off date of 11th August had been set at the Tier-1 beyond which we would avoid major upgrades (to any software component) prior to data taking. Meeting this deadline was a major driver for the CASTOR team and the situation may have been much worse if the upgrade had been allowed to slip any further. The WLCG project should consider if constraints should be put on the software release cycle in order to ensure that major upgrades only happen at safe points during the year.

A substantial period of testing (2 months)of CASTOR 2.1.7 was carried out prior to release at RAL to production and several bugs were identified over a period of weeks before a release was certified. This was the most systematically tested software product that we have ever released into production. Only a substantially enhanced testbed more closely matching the full multi-VO production configuration with ORACLE RACs would have had any likelihood of catching this bug and only then if the VOs were prepared to take part in extended periods of load testing. We do not have the staff effort available to carry out testing on this scale, and do not believe that even then it will eliminate all problems of this kind. We will make minor changes to the testing strategy based on experience we have of other more tractable bugs that emerge in 2.1.7 that could be caught by modest enhancements to certification.

Once we ran into problems we did not believe that we could (or that it was desirable) to roll back to 2.1.6. We will investigate what can be done to allow rollback and what constraints that would place to the service offered immediately after the upgrade. We also had no means of investigating the problem if we had been able to rollback to 2.1.6.

Although we had the full CASTOR team available during the week of the upgrade and the week following that, we were very short staffed in the subsequent two weeks which impaired our ability to handle and debug the extended problem. An upgrade at such a moment was clearly not ideal but the alternatives appeared worse. Staff leave was partly constrained but had partly been planned in the expectation that 2.1.7 would have been certified and in production much sooner. Support from CERN was very good once they appreciated the seriousness of the problem and were able to free up staff effort (they too had staff on leave).

The ORACLE RAC upgrades were needed to improve resilience and availability and were considered to be highly desirable prior to data taking in order minimise downtime and callouts. However the close coincidence of the database upgrades to the CASTOR upgrade added further complication. Most of the initial debugging activity focused on the CASTOR upgrade and vital time was lost before attention moved to the ORACLE infrastructure. This upgrade had been scheduled to happen several months previously but had been delayed owing initially to shortage of staff and then subsequently to technical problems in the RAC configuration. Nevertheless it is clear that co-scheduling multiple related changes carries additional risk. It may be that we should have cancelled the RAC upgrades as not vital and an unnecessary complication at a difficult time, however our experience of upgrades of this kind had previously been very good. In future we should be wary about such multiple changes in high risk areas. However this stance is inconsistent with the project's desire to minimise scheduled downtime by scheduling as much as possible during a single "super downtime".

Review of the progress towards certification and the evolving schedule was carried out on a weekly basis in liaison with the UK representatives of all four LHC experiments. The evolving plan is captured in the history log at: http://www.gridpp.ac.uk/wiki/RAL_Tier1_CASTOR_planning. It was also overseen by the weekly Tier-1 operations meeting where the planned schedule of changes is reviewed. The GRIDPP Project management Board (PMB) were also aware of the plans and tight upgrade schedule. Although various bodies consider that the schedule was tight they all believed that there was adequate margin of time between the schedule and the commencement of data taking.

Although staff at all levels (including eventually the GRIDPP project leader) became involved at various times once the fault was detected, no formal escalation process took place beyond the CASTOR team and the liaison meeting with the UK experiment contacts at the weekly CASTOR meeting. We will put in place such a process that will lead in several stages from slight snag to major disaster.

Once things started to go wrong, communication with the experiments and the project did not work as well as it should. Communication regarding the initial problem was late and updates were not as frequent as required. However, early on, the situation was confused with uncertainty as to what the degree and cause of the problem(s) were. RAL has recently appointed a "Production Manager" who's responsibilities include communication as incidents progress. We have highlighted this area as a priority area for improvement and expect to see substantial improvement over the next two months.

In retrospect it is clear that the upgrade schedule became increasingly compressed against our own planned drop dead date; staff leave and experiment data taking. This led to an increase in risk and complications when debugging became necessary. It may be that we should have taken a strategic decision to open negotiations with CERN in early August to discuss the possibility of ongoing support for CASTOR 2.1.6 throughout the remainder of 2008 however this was a highly undesirable option for the reasons discussed and overall we still believe that the right decision was made to proceed with the upgrade as planned.

Related issues:

We are aware that ASGC site has subsequently experienced similar problems after upgrading to CASTOR 2.1.7 sitting on top of a similar ORACLE RAC configuration..

Timeline

Date/Time Actually Started: 14/08/2008 10:00

Date/Time First Detected: A problem was detected on 14th when SAM tests started failing but it was unclear at that time exactly what the problem was, nor was it clear the service was entirely unusable.

Detected by: NAGIOS

Date/Time First Announced: First formal announcement was 20th August, although informal discussion had been going on for several days with UK ATLAS production team.

Downtimes Logged: Downtimes logged in GOCDB relating to this fault

16/08/08 07:30 to 18/08/08 11:00 ATLAS CASTOR at Risk

17/08/08 11:55 to 18/08/08 14:00 ATLAS/LHCB CASTOR unscheduled down

18/08/08 16:08 to 19/08/08 09:00 ATLAS/LHCB CASTOR unscheduled down

19/08/08 09:30 to 20/08/08 10:00 ATLAS CASTOR unscheduled down

20/08/08 10:40 to 27/08/08 13:00 ATLAS CASTOR unscheduled down

21/08/08 01:44 to 26/08/08 11:00 LHCB CASTOR unscheduled down

Date/Time of Other Advisories: See blog at: http://www.gridpp.rl.ac.uk/blog/2008/08/20/problems-on-atlas-castor-instance/

Date/Time of First Intervention: Hard to tell, various interventions were carried out as early as 14th August in an attempt to resolve various perceived problems.

Date/Time of Other Important Events/Decisions:

14/08/08 onward. Various related and unrelated problems impacted the service in the days just after the 13th August upgrade and various interventions were carried out which may have resolved some issues, but the unique constraint violation remained masked under much other activity.

19/08/08 Identified problem relates to unique constraint violation in Oracle

20/08/08 Experiments/CASTOR team liaison meeting concluded no resolution was likely soon and 7 day downtime should be set

27/08/08 Conclude that database (or data) corruption is unlikely and that simply restarting request handler clears problem. Commence new policy of manual/automated restart of CASTOR when automation detects problem and open CASTOR instances.

27/08/08 Apply kks_use_mutex_pin=false. Fault effectively resolved from this point

1/09/08 RAL revalidated as an ATLAS Tier-1


Date/Time problem was fixed: 27/08/08 (at least a work-around applied)

Date/Time problem was declared fixed: ATLAS revalidated us on 1st September. Although we have not yet declared the problem fixed.