RAL Tier1 Incident 20090422
Site: RAL-LCG2
Incident Date: 2009-04-22
Severity: Field not defined yet
Service: Home File System
Impacted: Small number of users
Incident Summary: Part of the home file system (/home/csf) diasappeared.
Type of Impact: Degraded
Incident duration: 24hours (approx)
Report date: 2009-05-06
Reported by: Gareth Smith
Related URLs: None
Incident details:
Date | Time | Who/What | Entry |
---|---|---|---|
2009-04-22 | 18:35 | M.Bly | Picked up ticket above and discovered own home file system missing. Created ticket RT#46444. |
2009-04-22 | 20:17 | T.Adye | Sent in ticket (RT#46447)reporting loss of files from home file system. |
2009-04-22 | |||
2009-04-22 | |||
2009-04-22 | |||
2009-04-23 | 9:20 | J.F.Wheeler | Restored home directory for userid bly from most recent full backup (completed 7:10, 20th April) and last incremental backup before problem (completed 4:50, 22nd April) |
2009-04-23 | 10:33 | G.Smith | Notification to gridpp_users |
2009-04-23 | 11:30 | J.F.Wheeler | Restore of /home/csf into directory /home/csf/RESTORED-230409 from full backup (completed 7:10, 20th April) and incremental backup (completed 4:50, 22nd April) |
2009-04-23 | 11:45 | J.F.Wheeler | Compared output of commands "ls /home/csf" and "ls /home/csf/RESTORED-230409" to determine which home directories had been completely removed. Restored directories for bfactory and jpk |
2009-04-23 | 15:40 | J.F.Wheeler | For all userids who had direct access to Tier1 (that is, those whose shell was not "/sbin/nologin"), ran command "ls -AR" on original and restored home directories and saved output into separate files (2 per userid). Used "diff" command to compare files for each userid and investigated differences. Userids with significant differences were: adye, minosmc, nwest. olaiya, gtf; removed all but these restored directories and informed users individually |
Future mitigation:
The cause of the loss is believed to be a consequence of the installation procedure of worker nodes (and other systems) in the Tier1.
Details here.....
A restore was carried out to a parallel directory structure and comparisons made with the live directory tree. In the end it was concluded that files had only been removed from a small number of users. It should be noted that this problem only affected those users who still have direct (non-grid) access to the RAL Tier1, which is restricted to a limited set of users.
Related issues:
None.
Timeline
Date | Time | Comment | |
---|---|---|---|
Actually Started | 2009-03-22 | Afternoon. | |
Fault first detected | 2009-03-22 | 17:09 | Ticket from user. |
First Advisory Issued | 2009-03-23 | 10:33 | E-mail to gridpp_users |
First Intervention | 2009-03-23 | 11:30 | Restore file system from backup into parallel area |
Fault Fixed | 2009-03-24 | ||
Announced as Fixed | 2009-03-24 | 17:30 | Specific mail sent to those users affected. Notification to gridpp_users. |
Downtime(s) Logged in GOCDB | None.
| ||
Other Advisories Issued | 2009-03-23 & 24 | 17:22 | E-mails to GridPP_users at 2009-04-23 10:33; 17:40; 2009-04-24 17:30. |