RAL Tier1 Incident 20090915 Castor failures after Nameserver upgrade

From GridPP Wiki
Jump to: navigation, search

Disk to Disk (D2D) Transfer Failures after Castor 2.1.8 Nameserver (NS) Upgrade

Site: RAL-LCG2

Incident Date: 2009-09-15

Severity: Tier 1 Disaster Management Process Not Triggered.

Service: Storage (Castor)

Impacted: All local VOs

Incident Summary: Disk to Disk (D2D) transfers started failing during a planned upgrade to the NS.

Type of Impact: Down

Incident duration: 44 hours

Report date: 2009-09-16

Reported by: Matthew Viljoen

Status of this Report: In preparation

Related URLs:

Incident Overview:

After applying a scheduled upgrade of the Castor NS from version 2.1.7-27 to version 2.1.8, during testing it was found that Disk to Disk (D2D) copies started failing across all instances. Any possible link with the NS upgrade was ruled out. Investigations carried out with the assistance of CASTOR developers at CERN revealed that there was an LSF job scheduler problem resulting in D2D transfer jobs failing. After LSF was restarted, both on the central servers and on all disk servers, the D2D transfer problem disappeared.

Why this problem started affecting services after the NS upgrade is so far unknown as it has not been possible to reproduce. The problem was initially believed to be as a result of a wrong LSF setup on disk servers, and an alternative configuration was deployed. However, this was later ruled out as a cause, since the disk servers were unmodified during the upgrade, and their setup had not changed for some months prior to the upgrade, during when they had been rebooted a number of times. Furthermore, after the incident, the original LSF setup has subsequently been retested and is found to work on non-production CASTOR instances.

We now believe that the problem was caused by a wrong procedure for stopping castor services and LSF prior to the NS upgrade, and bringing them up afterwards.

The D2D transfer problem had also been detected on the certification instance immediately prior to this upgrade during NS testing. The problem could have been localized and fixed at the time, but doing so would have delayed the upgrade. Since D2D transfers are unrelated to the NS, it was decided to proceed with the upgrade.

Future mitigation:

Issue Response
Certification instance was not in an optimal state prior to upgrade. Certification instance must always be fully functional, and no future upgrades should proceed if anything is broken.
Shutdown and startup sequence for LSF and other services was not standardized nor fully tested A clearly defined startup and shutdown sequence for the LSF scheduler (both central LSF master and on daemons on all disk servers) and other CASTOR services needs to be written and tested, and should be used during future upgrades.

Related issues: None.

Timeline

Date Time Comment
Actually Started 2009-09-15 09:00 Start of Outage for NS update
First Realisation of a problem 2009-09-15 09:30 Disk to disk copies starts failing during testing
First announcement of Problem 2009-09-15 12:26 Extension of downtime is announced as the problem has not been resolved
Problem resolved 2009-09-16 17:45 All instances appear to be functioning correctly but further testing is needed
Announced as Fixed 2009-09-17 09:11 Unscheduled downtime finishes and problem announced as fixed

Incident details Timeline:

Date Time Who/What Entry
2009-09-15 09:01 Tier1 castor team Services brought down for planned outage
2009-09-15 09:10 Tier1 Database team NS Database upgrded to 2.1.8-3
2009-09-15 09:15 Tier1 castor team NS RPMs upgraded to 2.1.8-3
2009-09-15 09:20 Tier1 castor team Finished NS upgrade and commenced testing on Gen instance. Staging files in and out work, but D2D transfers between dteamTest and genTape fail with LSF error message 127. CMS instance is brought up and the same problems are found.
2009-09-15 10:30 Tier1 castor team The cause of the D2D transfer failures is not yet determined. In an attempt to roll back all changes the NS upgrade is rolled back to 2.1.7-27 but the problem remains on both Gen and CMS.
2009-09-15 11:05 Tier1 castor team Still no progress in tracking the cause of the problem. An unscheduled downtime is agreed to be created from 12:00 until 2009-09-16.
2009-09-15 13:30 Tier1 castor team Sebastien Ponce from CERN suggests that lsgrun was not being found, indicating that the problem was due to an incorrect LSF environment. .bash_profile on the disk servers is modified to source in a LSF configuration file but this did not fix the issue.
2009-09-15 16:30 Tier1 castor team The NS database schema is reverted back to the previous version. Since does did not fix the issue either, any possible link between the NS upgrade and the continuing problems is ruled out
2009-09-16 11:05 Tier1 castor team Downtime is extended to 2009-09-16 at 1600
2009-09-16 14:00 Tier1 castor team D2D transfers are discovered to be consistently working on CMS diskserver gdss139. The problem is identified to be the LSF setup on disk servers resulting D2D transfer failures.
2009-09-16 14:30 Tier1 castor team An alternative LSF startup script is rolled out across all disk servers. LSF is restarted on the central server and on all disk servers, which fixes the problem.
2009-09-16 14:45 Tier1 castor team Having fixed the problem, other scheduled intervention is carried out (database kernel updates)
2009-09-16 14:55 Tier1 castor team NS is upgraded to 2.1.8-3 once again
2009-09-16 16:45 Tier1 castor team Database kernel upgrades are finished and testing resumes
2009-09-16 17:45 Tier1 castor team Initial tests show all instances are verified to be working apart from Gen D2D transfers
2009-09-17 09:00 Tier1 castor team Further testing successful. Public ports are opened and SAM tests pass.
2009-09-17 09:11 Tier1 Production team Unscheduled downtime is ended and an At-Risk is created until 2009-09-17 at 12:00.
2009-09-17 09:40 Tier1 Production team Gareth Smith announces that CASTOR is back in production to the GRIDPP-USERS mailing list.