RAL Tier1 Incident 20111202 VO Software Server

From GridPP Wiki
Revision as of 15:32, 13 September 2012 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Incident 2nd December 2011: VO Software Server Turned Off in Error

Description:

The hardware used for the software server for the non-LHC VOs (gdss142) was believed to be unused and a start was made on decommissioning the hardware. The system was taken down and a start made on erasing the disk drives ready for disposal. It wsa realised that the system was still required and fortunately the disk erasing was not successful. However, the server was unavailable from Friday afternoon until lunchtime the following Monday.

Impact

The software server is used by a number of VOs. Some of these were running batch work at the time. The problem was announced on the Friday (2nd December) and no GGUS tickets were received, so the operational impact on the affected VOs is unknown.

Timeline of the Incident

When What
Before mid-2010 Disk server re-allocated as a software server for non-LHC VOs.
28th November 2011 Fabric Team checking with other teams if the list of systems to be decommissioned were in use or not.
2nd December 2011 15:00(approx) Fabric Team shut-down system and initiate disk erasing.
2nd December 2011 15:23 Alarm received from Nagios for system gdss142.
2nd December 2011 15:33 RT Ticket updated with summary of situation - understanding of this system being a software server.
2nd December 2011 16:30 to 17:30 Attempts made to verify if the system was OK (ie. if the disks had been erased or not). On discovering the disks had not been deleted attempts were made (unsuccessfully) to bring the system back up.
5th December 2011 14:43 System back up and available for use.

Incident details

As part of the decommissioning a block of old disk servers, gdss142 was taken down for its disks to be erased by the Fabric team during the afternoon of Friday 2nd December. However, due to the CD drive being faulty this was unsuccessful, however some 5 to 10 other disk servers from this batch (Viglen 2006) were successfully decommissioned on the same day. At around 15:30 on that day other staff recognised an operational problem and that gdss142 is a VO Software Server. The system was immediately turned back on, although it did not boot. A specific note was made that this disk server should not be decommissioned.

The failure of the CD drive was very fortunate as the disks were not erased and the system could be put back into production. However, rebooting the system was problematic, until it was understood that the boot order had changed. The system was unavailable over the weekend.

Analysis

At some earlier time (early 2010 at the latest) disk server gdss142 was re-allocated from use within Castor to use as a software server for the non-LHC VOs. However, the disk server tracking database ("Overwatch") was not updated with this change.

As part of a subsequent migration of the "Overwatch" tracking database this disk server, which was marked as "decommissioned" was dropped from the database.

As part of the disposal process all disk within decommissioned systems are erased. At the point of finally switching off and erasing the disks on a batch of servers, including gdss142, there was no entry in the "Overwatch" tracking database and there was no comment or ticket which indicated that gdss142 was still in production as a Software Server. Reasonable attempts were made to check if any of these disk servers (including gdss142) were still in use. Kashif circulated the list of remaining Viglen 2006 disk servers (for decommissioning) via RT ticket #88153 to all appropriate teams. Some other systems within this batch were corretly identified as being still in use and these were not decommissioned.

On Friday 2nd December action was taken was to decommission the server by booting from a CD and running disk erasing software. However, a faulty CD reader on the system meant that this operation failed.

It was quickly realized that the system was production server and attempts were made to immediately reboot it. However that reboot failed and the server was down over the weekend. On Monday staff investigated and it was found that the boot order had changed. This was possibly as a result of trying to use a faulty device (the CD reader) which may have caused the BIOS to re-configure the boot order.

The subsequent analysis of this system showed that although its software is installed and configured using Quattor, this would not have been deploying updates (errata etc.).

Follow Up

Issue Response Done
Understand Root Cause. The root cause has been understood to be the failure to update the tracking database when the role of this server was changed. It is part of our procedures that this should be done. Yes
There was no process it flag up a system that was not configured to receive updates (errata). Although not a direct cause of this problem this would have helped flag up an incorrectly managed system. Review, with the aim of implementing, a possible test for incorrectly managed systems, for example by checking errata are regularly applied. No
Production Server Stopped and Disks Blanked without giving an opportunity to flush out any problems. The procedures should be changed such that systems are turned off for a length of time (propose two working weeks) before blanking disks. Furthermore machines should not be switched off in this way towards the end of the day (especially not on a Friday). Yes

Reported by: Gareth Smith. 13th December 2011

Summary Table

Start Date Friday 2nd December 2011
Impact <20%
Duration of Outage 3 days
Status Open
Root Cause Human Error
Data Loss No