Difference between revisions of "RAL Tier1 Summary of Post Mortems"
Line 111: | Line 111: | ||
===20150408 Network intervention preceding Castor upgrade=== | ===20150408 Network intervention preceding Castor upgrade=== | ||
'''Link: [[RAL Tier1 Incident 20150408 network intervention preceding Castor upgrade]]''' | '''Link: [[RAL Tier1 Incident 20150408 network intervention preceding Castor upgrade]]''' | ||
− | |||
− | |||
− | |||
− | |||
− | |||
'''Status: Draft''' | '''Status: Draft''' |
Revision as of 16:40, 22 July 2016
Contents
- 1 Summary of Post Mortems from the RAL Tier1.
- 2 Open Post Mortems
- 2.1 20100801 Disk Server Data Loss Atlas
- 2.2 20101201 Power Outage
- 2.3 20101216 Broken Tape Data Loss Alice
- 2.4 20110202 Tape Data Loss LHCb
- 2.5 20110202 Disk Server GDSS502 Data Loss T2K
- 2.6 20111031 Castor Atlas Outage Caused by Bad Execution Plans
- 2.7 20111202 VO Software Server Turned Off in Error
- 2.8 20120316 Network Packet Storm
- 2.9 20120613 Oracle11 Update Failure
- 2.10 20121107 Site Wide Power Failure
- 2.11 20121120 UPS Over Voltage
- 2.12 20131120 Disk Server Failure File Loss
- 2.13 20130626 Failure of RAL CVMFS Stratum1 Triggered Batch Farm Problems
- 2.14 20130628 Atlas Castor Outage
- 2.15 20140118 FTS3 server migration and Outage
- 2.16 20150408 Network intervention preceding Castor upgrade
- 3 Closed Post Mortems
- 3.1 20091004 Castor database disk failure
- 3.2 20091009 Castor data loss
- 3.3 20091130 RAID5 double disk failure
- 3.4 20100129 Extended outage migrating Castor databases
- 3.5 20100212 Tape problems led to data loss
- 3.6 20100515 Disk Server Outage
- 3.7 20100630 Disk Server Data Loss CMS
- 3.8 20100720 Database Intervention
- 3.9 20100916 Second Failure of Disk Server - CMS data loss.
- 3.10 20101026 LHCb SRM returns Bad TURL and Subsequently Suffers Outage
- 3.11 20101108 Atlas Disk Server GDSS398 Data Loss
- 3.12 20101121 Atlas Disk Server GDSS391 Data Loss
- 3.13 20101225 CMS Disk Server GDSS283 data Loss
- 3.14 20101207 Network Outage
- 3.15 20101231 PDU Problem
- 3.16 20110106 CMS Disk Server GDSS496 Data Loss
- 3.17 20110510 LFC Outage After DB Update
- 3.18 20110927 Atlas Castor Outage DB Inconsistent
- 3.19 20111022 Castor Outage with RAC Nodes Crashing
- 3.20 20111215 Network Break Followed by DNS Problems then Atlas SRM Database Problem
Summary of Post Mortems from the RAL Tier1.
This summary lists the Post Mortems generated by the RAL Tier1, gives the current state of each Post Mortem, and lists any outstanding mitigating actions that have arisen. The numbers after the action refer to the Tier1 internal (Footprints) incident tracking system.
Open Post Mortems
20100801 Disk Server Data Loss Atlas
Link: RAL Tier1 Incident 20100801 Disk Server Data Loss Atlas
Status: Open
- Modify procedures to include re-activating the RAID array as an additional method of trying to recover data. (227)
20101201 Power Outage
Link: RAL Tier1 Incident 20101201 Power Outage
Status: Open
- UPS power for castor Disk Servers
- FSCK check on ext3 file systems (224)
20101216 Broken Tape Data Loss Alice
Link: RAL Tier1 Incident 20101216 Broken Tape Data Loss Alice
Status: Open
- Obtain costings and timings for data recovery from broken or damaged tapes. (229)
20110202 Tape Data Loss LHCb
Link: RAL Tier1 Incident 20110202 Tape Data Loss LHCb
Status: Open
- Instigate regular check of tape library logs for this type of event. (229)
- Instigate regular check of Castor logs for tape access for this type of problem. (229)
20110202 Disk Server GDSS502 Data Loss T2K
Link: RAL Tier1 Incident 20110330 Disk Server GDSS502 Data Loss T2K
Status: Open
- Modify procedures such that any disk server showing a "failed stripes" error is quickly drained.
- Modify the Nagios tests used for disk servers with these RAID controllers to test for failed stripes.
- It is planned that disk servers will be powered by the UPS. This should be implemented.
- Modify the Disk Server deployment process to include a specific check that the battery backup for the RAID cache is enabled.
20111031 Castor Atlas Outage Caused by Bad Execution Plans
Link: RAL Tier1 Incident 20111031 Castor ATLAS Outage
Status: Open
- Check monitoring of the database servers to flag load issues.
20111202 VO Software Server Turned Off in Error
Link: RAL Tier1 Incident 20111202 VO Software Server
Status: Open
- Implement a possible test for incorrectly managed systems, for example by checking errata are regularly applied.
20120316 Network Packet Storm
Link: RAL Tier1 Incident 20120316 Network Packet Storm
Status: Open
- Investigate storm suppression & the use of Spanning Tree on Tier1 network.
- Change procedures so test/development at the edges of the Tier network & look at adding spanning tree to the test switches.
20120613 Oracle11 Update Failure
Link: RAL Tier1 Incident 20120613 Oracle11 Update Failure
Status: Open
- Investigate, and implement, an alternative method of connecting to the system to allow for a reconnection in the event of a network break.
- Check, and enable where possible, logging for upgrade processes.
- Make a system available a system for regular validation of database backups
20121107 Site Wide Power Failure
Link: RAL Tier1 Incident 20121107 Site Wide Power Failure
Status: Open
20121120 UPS Over Voltage
Link: RAL Tier1 Incident 20121120 UPS Over Voltage
Status: Open
- Review systems to identify areas with too few staff having critical expertise.
- Review call-out system for the case where UPS power is lost (either momentarily or for a longer time).
20131120 Disk Server Failure File Loss
Link: RAL Tier1 Incident 20130219 Disk Server Failure File Loss
Status: Open
- Validation that the tests on disk failures for this batch of servers are correctly configured - both to detect disk failures and to call out as appropriate.
- Check the appropriateness and timeliness of the documentation on disk server states and spares. Make the awareness of this information part of the induction for new members of the Fabric Team.
- Validate the spares for this batch of disk servers. Ensure there are both spare disks and a spare working server.
20130626 Failure of RAL CVMFS Stratum1 Triggered Batch Farm Problems
Link: RAL Tier1 Incident 20130626 Failure of RAL CVMFS Stratum1 Triggered Batch Farm Problems
Status: Open
- Clarify working procedures when some staff involved in an incident are working remotely.
20130628 Atlas Castor Outage
Link: RAL Tier1 Incident 20130628 Atlas Castor Outage
Status: Open
- Establish a policy for the application of Castor hotfixes.
- Review Castor Team complement/skills.
20140118 FTS3 server migration and Outage
Link: RAL Tier1 Incident 20140218 FTS3 server migration and Outage
20150408 Network intervention preceding Castor upgrade
Link: RAL Tier1 Incident 20150408 network intervention preceding Castor upgrade
Status: Draft
Closed Post Mortems
20091004 Castor database disk failure
Link: RAL Tier1 Incident 20091004 Castor database disk failure
Status: Closed
20091009 Castor data loss
Link: RAL Tier1 Incident 20091009 Castor data loss
Status: Closed
20091130 RAID5 double disk failure
Link: RAL Tier1 Incident 20091130 RAID5 double disk failure
Status: Closed
20100129 Extended outage migrating Castor databases
Link: RAL Tier1 Incident 20100129 Extended outage migrating Castor databases
Status: Closed
20100212 Tape problems led to data loss
Link: RAL Tier1 Incident 20100212 Tape problems led to data loss
Status: Closed
20100515 Disk Server Outage
Link: RAL Tier1 Incident 20100515 Disk Server Outage
Status: Closed
20100630 Disk Server Data Loss CMS
Link: RAL Tier1 Incident 20100630 Disk Server Data Loss CMS
Status: Closed
20100720 Database Intervention
Link: RAL Tier1 Incident 20100720 Database intervention
Status: Closed
20100916 Second Failure of Disk Server - CMS data loss.
Link: RAL Tier1 Incident 20100916 Second Failure of Disk Server-CMS data loss
Status: Closed
20101026 LHCb SRM returns Bad TURL and Subsequently Suffers Outage
Link: RAL Tier1 Incident 20101026 LHCb SRM Bad TURL and Outage
Status: Closed
20101108 Atlas Disk Server GDSS398 Data Loss
Link: RAL Tier1 Incident 20101108 Atlas Disk Server GDSS398 Data Loss
Status: Closed
20101121 Atlas Disk Server GDSS391 Data Loss
Link: RAL_Tier1_Incident_20101121_Atlas_Disk_Server_GDSS391_Data_Loss
Status: Closed
20101225 CMS Disk Server GDSS283 data Loss
Link: RAL Tier1 Incident 20101225 CMS Disk Server GDSS283 Data Loss
Status: Closed
20101207 Network Outage
Link: RAL Tier1 Incident 20101207 Network Outage
Status: Closed
20101231 PDU Problem
Link: RAL Tier1 Incident 20101231 PDU Problem
Status: Closed
20110106 CMS Disk Server GDSS496 Data Loss
Link: RAL Tier1 Incident 20110106 CMS Disk Server GDSS496 Data Loss
Status: Closed
20110510 LFC Outage After DB Update
Link: RAL Tier1 Incident 20110510 LFC Outage After DB Update
Status: Closed
20110927 Atlas Castor Outage DB Inconsistent
Link: RAL Tier1 Incident 20110927 Atlas Castor Outage DB Inconsistent
Status: Closed
20111022 Castor Outage with RAC Nodes Crashing
Link: RAL Tier1 Incident 20111022 Castor Outage RAC Nodes Crashing
Status: Closed
20111215 Network Break Followed by DNS Problems then Atlas SRM Database Problem
Link: RAL Tier1 Incident 20111215 Network BReak Atlas SRM DB
Status: Closed