RAL Tier1 Incident 20081124 Service outage following network reconfiguration

From GridPP Wiki
Jump to: navigation, search

Site: RAL-LCG2

Incident Date: 2008-11-23 & 24

Severity: Field not defined yet

Service: FTS, LFC, RBs, SRM, CE, Batch farm (PBS master), Nagios

Impacted: All local VOs

Incident Summary: Network stack reconfiguration late Sunday evening (for reasons unknown) broke connectivity to some nodes. This included the nagios back end and there was no call-out. Significant number of services lost.

Type of Impact: Down

Incident duration: About 11 hours.

Report date: 2008-11-24

Reported by: Gareth Smith (based on information supplied by Martin Bly)

Related URLs: http://www.gridpp.rl.ac.uk/blog/2008/11/24/problem/

Incident details:


Date Time Who/What Entry
2008-11-23 23:00 (approx) A reconfiguration event on stack-2 due to a perceived fault pushed unit 1 (the Nortel 5530) out of stack-2 and unit 3 became a temporary base for the remaning two units (both Nortel 5510s). Stack-2 was effectively cut-off from the rest of the Tier1. Services include: RBs, Mysql databases, LFC and FTS Oracle databases, SL3 UIs, Nagios (including callouts), SRMv2 hosts (all but two)
2008-11-24 08:25 (approx) M.Bly Unable to: ping units on stack-2; access stack-2 using Nortel java tool; ping stack-2

Connected laptop to stack-2 unit 1 via serial console port and looged in. Noted no data in logs. Issued 'stack reset' to stack-2 at approximatley 08:33. Stack-2 recovered at ~8:35

2008-11-24 after 09:00 M.Bly It appears none of the stacks are logging to the Tier1 central logger service but that one or two are logging to one of the Networkig group servers.

Turned on remote logging such as it is on all the stack units and directed them to log to one of the system loggers (logger1).

2008-11-24 09:00 to 10:00 Admin On Duty (M.Hodges) Checked services back.

There is no log information which shows any indication why the system did what it did.

The Nagios system was effected because it could not contact its Mysql database on lcgsql0363 (stack-2 unit 2).
The RBs all live on stack-2 unit 2. Both the LFC abd FTS Oracle databases (lcgdb00, lcgdb98) are hosted on stack-2 unit 2. Several other service machines were also cut off. Stack-2 unit3 hosts only batch workers.


Future mitigation:

Issue Response
Lack of logging from the network boxes. Logging now enabled to central logger.
Verification of firmware versions in switches and cable connections. All switches in this stack operate the same firmware version - they would not be able to stack together if they didn't. Flapping connections would have shown up in the logs.
Improve detection of failure of monitoring system. The Nagios system itself is monitored. In this case Nagios lost access to its back end database and was stuck rather than having crashed. (Note: The daily check of the pager at about 4pm would have detected this.) The Nagios monitoring system should be moved such that both the monitoring processes and the back end database are hosted on the same system. A further step for future consideration is to move both to more resilient hardware. This system to be in a rack with dual uplinks (see next point). A future upgrade to Nagios version 3 should be considered as this would improve the resilience of Nagios itelf (more multi-threaded code). However, such an upgrade would need careful planning and timing so as to ensure all necessary Nagios configuration changes necessitated by the new version are sufficiently tested.
Should we dual the uplinks? We can dual the uplinks but as each stack currently relies on a single unit to provide its uplinks at 10Gbit, this would only be effective if each stack was provided with an additional node (Nortel 5530). This would require around nine extra 5530s to do this for all existing stacks and would have significant cost. However, this should be done for the stack that contains the monitoring.
Would we have handled the failure sucessfully if we had managed to callout? The oncall would not have been able to diagnose this remotely - it would have looked like a power failure to the appropriate racks, requiring on-site attendance. However, an escalation of the problem should then happen and that would be expected to lead to a resolution although on a longer timescale.
Could staff have sorted it out in the middle of the night and do we have the right spares etc? If the on-call attended site and diagnosed the switches were at fault the simple solution would be to power them off then on again - the stack should recover for this particular fault. However, other types of fault are more complex and require a more expert intervention on the switch.

However this incident has revealed that we do not have spare 5530s or 5510s within the team. We can obtain switches from Networking Group but not so easily out of hours. Spares of these units should be obtained. Likewise an up-to-date network diagram needs to be made available to those on-call.

Related issues: None.

Timeline

Date Time Comment
Actually Started 2008-11-23 23:00 (approx) Initial network stack reconfiguration
Fault first detected 2008-11-24 08:25 (approx) Staff arrived at work.
First Advisory Issued 2008-11-24 08:25 to 10:00 Broadcast as Unscheduled Outages announced via GOC DB.
First Intervention 2008-11-24 08:33 (approx) to 10:00 Reset of stack.
Fault Fixed 2008-11-24 10:00 All services verified as up.
Announced as Fixed 2008-11-24 08:25 10:00 Broadcast at end of Unscheduled Outage via GOC DB.
Downtime(s) Logged in GOCDB 2008-11-24 08:25 to 10:00 Unscheduled outage. Whole site.
E-mail to csf-l@jiscmail.ac.uk 2008-11-24 08:58
EGEE Broadcast 2008-11-24 11:10 (approx) To VO managers.
E-mail to atlas-uk-comp-operations@cern.ch 2008-11-24 11:14
E-mail to gridpp-users@jiscmail.ac.uk 2008-11-24 11:37 (Delayed as typo in address initially)