Difference between revisions of "RAL Tier1 Incident 20160600 Tape Library Software Crashes"

From GridPP Wiki
Jump to: navigation, search
(Created page with "==RAL-LCG2 Incident 20160600 Tape Library Software Crashest== ===Description:=== The Control Software for the Tape Libraries (ACSLS) crashes/hangs frequently ===Impact=== ...")
 
(Timeline of the Incident)
Line 28: Line 28:
 
| Upgrade tape servers lcgcts{10,11,20,21,27} to 2.1.16-0 (31% of drives; lcgcts{20,21} have two drives attached)
 
| Upgrade tape servers lcgcts{10,11,20,21,27} to 2.1.16-0 (31% of drives; lcgcts{20,21} have two drives attached)
 
|-
 
|-
 +
| 3rd May
 +
| Elevator in T1 tape library goes offline
 +
|-
 +
| 9th May
 +
| Engineer on site. No luck getting elevator working. Second elevator in T1 library dies during BMS outage.
 +
|-
 +
| 10th May
 +
| Engineer swaps third of the cards that had been delivered.  No luck getting both elevators working
 +
|-
 +
| 11th May
 +
| Problem escalating with no power to LSMs in T1 library.
 +
|
 
| 17th May 2016
 
| 17th May 2016
 
| Upgrade all remaining tape servers to 2.1.16-0
 
| Upgrade all remaining tape servers to 2.1.16-0
 +
| -
 +
| 19th May 2016
 +
| All Facilities tape servers at 2.1.16-0
 +
|-
 +
| 20th May 2016
 +
| Restarter cron added to all tape servers to cope with Castor mount "feature" and tape library problems.
 +
|-
 +
| 25th May
 +
| acsls swapped over to cftsh05 due to problems with Buxton.
 +
|-
 +
| 7th June
 +
| acsls swapped back to Buxton
 +
|-
 +
| 17th June
 +
| Buxton1 (acsls 8.4) put into production.
 
|}
 
|}
 
  
 
===Incident details===
 
===Incident details===

Revision as of 10:39, 22 June 2016

RAL-LCG2 Incident 20160600 Tape Library Software Crashest

Description:

The Control Software for the Tape Libraries (ACSLS) crashes/hangs frequently

Impact

Describe the type of impact. Include which services / VOs. How long they were impacted for and give the dates. If data loss ensure this is clearly flagged.


Timeline of the Incident

When What
17th February 2016 Upgrade all tape servers to 2.1.15-20
25th April 2016 Upgrade tape servers lcgcts{12,13} to 2.1.16-0 (7% of drives)
26th April 2016 Upgrade tape servers lcgcts{10,11,20,21,27} to 2.1.16-0 (31% of drives; lcgcts{20,21} have two drives attached)
3rd May Elevator in T1 tape library goes offline
9th May Engineer on site. No luck getting elevator working. Second elevator in T1 library dies during BMS outage.
10th May Engineer swaps third of the cards that had been delivered. No luck getting both elevators working
11th May Problem escalating with no power to LSMs in T1 library. 17th May 2016 Upgrade all remaining tape servers to 2.1.16-0 - 19th May 2016 All Facilities tape servers at 2.1.16-0
20th May 2016 Restarter cron added to all tape servers to cope with Castor mount "feature" and tape library problems.
25th May acsls swapped over to cftsh05 due to problems with Buxton.
7th June acsls swapped back to Buxton
17th June Buxton1 (acsls 8.4) put into production.

Incident details

Put a reasonably detailed description of the incident here.


Analysis

This section to include a breakdown of what happened. Include any related issues.


Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

List any related issue and provide links if possible. If there are none then remove this section.


Reported by: Your Name at date/time

Summary Table

Start Date Date e.g. 20 July 2010
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage Hours e.g. 3hours
Status select one from Draft, Open, Understood, Closed
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes/No