RAL-LCG2 Incident 20150408 network intervention preceding Castor upgrade =====Description:=

Change control for Castor upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=148453

Castor upgrade procedure: https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/CastorUpgradeTo211415

RT ticket tracking upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=149684

The Castor team were aware of the network intervention but did not include it in their plan as it was believed to be minor.


The entire Tier 1 and all services were declared in downtime until the following day.

Timeline of the Incident

When What
8/04/2015 10:00 Downtime starts in the GOCDB. Very shortly afterwards the network dropped (The offices could not contact the machine room). Castor team (Rob A) spoke to Fabric. Martin Bly, was surprised that the network had gone down but he also believed Castor was already down.
8/04/2015 10:15 - 10:40 Network is restored and Castor is brought down cleanly. DB team asked to start work. Rob A commits quattor change for software upgrade (as planned).
8/04/2015 11:45 Castor change is abandon and rollback starts. Note: DB team never started upgrade.

Incident details


Follow Up

Related issues

