Difference between revisions of "RAL Tier1 Summary of Post Mortems"

From GridPP Wiki
Jump to: navigation, search
(Open Post Mortems)
Line 111: Line 111:
 
===20150408 Network intervention preceding Castor upgrade===
 
===20150408 Network intervention preceding Castor upgrade===
 
'''Link: [[RAL Tier1 Incident 20150408 network intervention preceding Castor upgrade]]'''
 
'''Link: [[RAL Tier1 Incident 20150408 network intervention preceding Castor upgrade]]'''
 +
 +
'''Status: Draft'''
 +
 +
===20170818 Data Loss on Echo following crashing OSD===
 +
'''Link: [[RAL Tier1 Incident 20170818 first Echo data loss]]'''
  
 
'''Status: Draft'''
 
'''Status: Draft'''

Revision as of 14:41, 18 September 2017

Contents

Summary of Post Mortems from the RAL Tier1.

This summary lists the Post Mortems generated by the RAL Tier1, gives the current state of each Post Mortem, and lists any outstanding mitigating actions that have arisen. The numbers after the action refer to the Tier1 internal (Footprints) incident tracking system.

Open Post Mortems

20100801 Disk Server Data Loss Atlas

Link: RAL Tier1 Incident 20100801 Disk Server Data Loss Atlas

Status: Open

  • Modify procedures to include re-activating the RAID array as an additional method of trying to recover data. (227)

20101201 Power Outage

Link: RAL Tier1 Incident 20101201 Power Outage

Status: Open

  • UPS power for castor Disk Servers
  • FSCK check on ext3 file systems (224)

20101216 Broken Tape Data Loss Alice

Link: RAL Tier1 Incident 20101216 Broken Tape Data Loss Alice

Status: Open

  • Obtain costings and timings for data recovery from broken or damaged tapes. (229)

20110202 Tape Data Loss LHCb

Link: RAL Tier1 Incident 20110202 Tape Data Loss LHCb

Status: Open

  • Instigate regular check of tape library logs for this type of event. (229)
  • Instigate regular check of Castor logs for tape access for this type of problem. (229)

20110202 Disk Server GDSS502 Data Loss T2K

Link: RAL Tier1 Incident 20110330 Disk Server GDSS502 Data Loss T2K

Status: Open

  • Modify procedures such that any disk server showing a "failed stripes" error is quickly drained.
  • Modify the Nagios tests used for disk servers with these RAID controllers to test for failed stripes.
  • It is planned that disk servers will be powered by the UPS. This should be implemented.
  • Modify the Disk Server deployment process to include a specific check that the battery backup for the RAID cache is enabled.

20111031 Castor Atlas Outage Caused by Bad Execution Plans

Link: RAL Tier1 Incident 20111031 Castor ATLAS Outage

Status: Open

  • Check monitoring of the database servers to flag load issues.

20111202 VO Software Server Turned Off in Error

Link: RAL Tier1 Incident 20111202 VO Software Server

Status: Open

  • Implement a possible test for incorrectly managed systems, for example by checking errata are regularly applied.

20120316 Network Packet Storm

Link: RAL Tier1 Incident 20120316 Network Packet Storm

Status: Open

  • Investigate storm suppression & the use of Spanning Tree on Tier1 network.
  • Change procedures so test/development at the edges of the Tier network & look at adding spanning tree to the test switches.

20120613 Oracle11 Update Failure

Link: RAL Tier1 Incident 20120613 Oracle11 Update Failure

Status: Open

  • Investigate, and implement, an alternative method of connecting to the system to allow for a reconnection in the event of a network break.
  • Check, and enable where possible, logging for upgrade processes.
  • Make a system available a system for regular validation of database backups

20121107 Site Wide Power Failure

Link: RAL Tier1 Incident 20121107 Site Wide Power Failure

Status: Open

20121120 UPS Over Voltage

Link: RAL Tier1 Incident 20121120 UPS Over Voltage

Status: Open

  • Review systems to identify areas with too few staff having critical expertise.
  • Review call-out system for the case where UPS power is lost (either momentarily or for a longer time).

20131120 Disk Server Failure File Loss

Link: RAL Tier1 Incident 20130219 Disk Server Failure File Loss

Status: Open

  • Validation that the tests on disk failures for this batch of servers are correctly configured - both to detect disk failures and to call out as appropriate.
  • Check the appropriateness and timeliness of the documentation on disk server states and spares. Make the awareness of this information part of the induction for new members of the Fabric Team.
  • Validate the spares for this batch of disk servers. Ensure there are both spare disks and a spare working server.

20130626 Failure of RAL CVMFS Stratum1 Triggered Batch Farm Problems

Link: RAL Tier1 Incident 20130626 Failure of RAL CVMFS Stratum1 Triggered Batch Farm Problems

Status: Open

  • Clarify working procedures when some staff involved in an incident are working remotely.

20130628 Atlas Castor Outage

Link: RAL Tier1 Incident 20130628 Atlas Castor Outage

Status: Open

  • Establish a policy for the application of Castor hotfixes.
  • Review Castor Team complement/skills.

20140118 FTS3 server migration and Outage

Link: RAL Tier1 Incident 20140218 FTS3 server migration and Outage

20150408 Network intervention preceding Castor upgrade

Link: RAL Tier1 Incident 20150408 network intervention preceding Castor upgrade

Status: Draft

20170818 Data Loss on Echo following crashing OSD

Link: RAL Tier1 Incident 20170818 first Echo data loss

Status: Draft

Closed Post Mortems

20091004 Castor database disk failure

Link: RAL Tier1 Incident 20091004 Castor database disk failure

Status: Closed

20091009 Castor data loss

Link: RAL Tier1 Incident 20091009 Castor data loss

Status: Closed

20091130 RAID5 double disk failure

Link: RAL Tier1 Incident 20091130 RAID5 double disk failure

Status: Closed

20100129 Extended outage migrating Castor databases

Link: RAL Tier1 Incident 20100129 Extended outage migrating Castor databases

Status: Closed

20100212 Tape problems led to data loss

Link: RAL Tier1 Incident 20100212 Tape problems led to data loss

Status: Closed

20100515 Disk Server Outage

Link: RAL Tier1 Incident 20100515 Disk Server Outage

Status: Closed

20100630 Disk Server Data Loss CMS

Link: RAL Tier1 Incident 20100630 Disk Server Data Loss CMS

Status: Closed

20100720 Database Intervention

Link: RAL Tier1 Incident 20100720 Database intervention

Status: Closed

20100916 Second Failure of Disk Server - CMS data loss.

Link: RAL Tier1 Incident 20100916 Second Failure of Disk Server-CMS data loss

Status: Closed

20101026 LHCb SRM returns Bad TURL and Subsequently Suffers Outage

Link: RAL Tier1 Incident 20101026 LHCb SRM Bad TURL and Outage

Status: Closed

20101108 Atlas Disk Server GDSS398 Data Loss

Link: RAL Tier1 Incident 20101108 Atlas Disk Server GDSS398 Data Loss

Status: Closed

20101121 Atlas Disk Server GDSS391 Data Loss

Link: RAL_Tier1_Incident_20101121_Atlas_Disk_Server_GDSS391_Data_Loss

Status: Closed

20101225 CMS Disk Server GDSS283 data Loss

Link: RAL Tier1 Incident 20101225 CMS Disk Server GDSS283 Data Loss

Status: Closed

20101207 Network Outage

Link: RAL Tier1 Incident 20101207 Network Outage

Status: Closed

20101231 PDU Problem

Link: RAL Tier1 Incident 20101231 PDU Problem

Status: Closed

20110106 CMS Disk Server GDSS496 Data Loss

Link: RAL Tier1 Incident 20110106 CMS Disk Server GDSS496 Data Loss

Status: Closed

20110510 LFC Outage After DB Update

Link: RAL Tier1 Incident 20110510 LFC Outage After DB Update

Status: Closed

20110927 Atlas Castor Outage DB Inconsistent

Link: RAL Tier1 Incident 20110927 Atlas Castor Outage DB Inconsistent

Status: Closed

20111022 Castor Outage with RAC Nodes Crashing

Link: RAL Tier1 Incident 20111022 Castor Outage RAC Nodes Crashing

Status: Closed

20111215 Network Break Followed by DNS Problems then Atlas SRM Database Problem

Link: RAL Tier1 Incident 20111215 Network BReak Atlas SRM DB

Status: Closed