RAL Tier1 Incident 20090422

From GridPP Wiki
Jump to: navigation, search

Site: RAL-LCG2

Incident Date: 2009-04-22

Severity: Field not defined yet

Service: Home File System

Impacted: Small number of users

Incident Summary: Part of the home file system (/home/csf) diasappeared.

Type of Impact: Degraded

Incident duration: 24hours (approx)

Report date: 2009-05-06

Reported by: Gareth Smith

Related URLs: None

Incident details:


Date Time Who/What Entry
2009-04-22 18:35 M.Bly Picked up ticket above and discovered own home file system missing. Created ticket RT#46444.
2009-04-22 20:17 T.Adye Sent in ticket (RT#46447)reporting loss of files from home file system.
2009-04-22
2009-04-22
2009-04-22
2009-04-23 9:20 J.F.Wheeler Restored home directory for userid bly from most recent full backup (completed 7:10, 20th April) and last incremental backup before problem (completed 4:50, 22nd April)
2009-04-23 10:33 G.Smith Notification to gridpp_users
2009-04-23 11:30 J.F.Wheeler Restore of /home/csf into directory /home/csf/RESTORED-230409 from full backup (completed 7:10, 20th April) and incremental backup (completed 4:50, 22nd April)
2009-04-23 11:45 J.F.Wheeler Compared output of commands "ls /home/csf" and "ls /home/csf/RESTORED-230409" to determine which home directories had been completely removed. Restored directories for bfactory and jpk
2009-04-23 15:40 J.F.Wheeler For all userids who had direct access to Tier1 (that is, those whose shell was not "/sbin/nologin"), ran command "ls -AR" on original and restored home directories and saved output into separate files (2 per userid). Used "diff" command to compare files for each userid and investigated differences. Userids with significant differences were: adye, minosmc, nwest. olaiya, gtf; removed all but these restored directories and informed users individually


Future mitigation:

The cause of the loss is believed to be a consequence of the installation procedure of worker nodes (and other systems) in the Tier1.

 Details here.....

A restore was carried out to a parallel directory structure and comparisons made with the live directory tree. In the end it was concluded that files had only been removed from a small number of users. It should be noted that this problem only affected those users who still have direct (non-grid) access to the RAL Tier1, which is restricted to a limited set of users.


Related issues:

None.

Timeline

Date Time Comment
Actually Started 2009-03-22 Afternoon.
Fault first detected 2009-03-22 17:09 Ticket from user.
First Advisory Issued 2009-03-23 10:33 E-mail to gridpp_users
First Intervention 2009-03-23 11:30 Restore file system from backup into parallel area
Fault Fixed 2009-03-24
Announced as Fixed 2009-03-24 17:30 Specific mail sent to those users affected. Notification to gridpp_users.
Downtime(s) Logged in GOCDB None.


Other Advisories Issued 2009-03-23 & 24 17:22 E-mails to GridPP_users at 2009-04-23 10:33; 17:40; 2009-04-24 17:30.