RAL Tier1 Incident 20150408 network intervention preceding Castor upgrade
RAL-LCG2 Incident 20150408 network intervention preceding Castor upgrade
On 8th April two apparently straightforward network changes were made at the start of a planned upgrade to the Castor system. A network problem was triggered that took most of the day to resolve (and was not finally cleared until the following morning.) The Castor update had to be backed out and there were some problems in doing this.
The entire Tier 1 and all services were declared in downtime until the following day. Facilities services hosted on the Tier1 network were also affected for most of the day (8th April).
Timeline of the Incident
|8/04/2015 8:10||Condor set not to start jobs for non-LHC VOs. Reminder email sent to Gridpp users mailing list.|
|8/04/2015 10:00||Downtime starts in the GOCDB. Very shortly afterwards network connectivity between the offices and the machine room dropped. Castor team (RA) spoke to Fabric. MB was surprised that the network had gone down but he also believed Castor was already down.|
|8/04/2015 10:31||GOC DB entry added "Networking problem at RAL, we are investigating"|
|8/04/2015 10:15 - 10:40||Network is restored and Castor is brought down cleanly. DB team asked to start work. RA commits Quattor change for software upgrade (as planned).|
|08/04/2015 10:34 - 10:58||UKLR links down|
|8/04/2015 11:05||Very many alarms due to network problems.|
|08/04/2015 11:28 - 11:56||UKLR links down|
|8/04/2015 11:45||Castor change is abandoned and rollback starts. Note: DB team never started upgrade, the time was spent on DB backups.|
|8/04/2015 13:40||JK puts entire site into downtime until 18:00.|
|08/04/2015 13:53-14:50||UKLR links down|
|8/04/2015 14:00||MB, TF and JA spend several hours attempting to restore functionality and stability to routers and core switches.|
|08/04/2015 14:07||swt-z9000-1 rebooted by MB.|
|8/04/2015 15:00(ish)||Work done to see if the problem was at the x670 switches. At the time there was no network access, so on the assumption that if we swapped onto the rtr-x670-1 switch and the network came back to life (even if only for a few minutes) it would indicate that it was at the switch end. A swap to -1 was made and did not appear to make any difference (it would not have as thinking about it later it was still in its inactive state). The MASTER function was flipped back to rtr-x670-2. This appears to have made things work again.|
|8/04/2015 15:56||AoD sends message to Facilities OPS list informing them of problems.|
|8/04/2015 17:00 - 17:20||Review meeting with MB, JA, JK and RA. Network has been reverted to original state. We agreed to bring Castor back up.|
|8/04/2015 17:30||Castor is brought up, however basic Castor tests do not run cleanly even with no load on service. People have gone home so Castor is left 'up' but site is declared in downtime until 14:00 the following day.|
|8/04/2015 18:07||Next GOC DB entry "Network problems continue at RAL"|
|8/04/2015 18:10||AoD sends message to Facilities-OPS list updating them on problems.|
|8/04/2015 17:30 - 21:30||AD speaks to RA and JK to understand extent of problems. We check if there is any disaster recovery documentation on twiki (there is but it is from 2009). Email out plan for following day.|
|8/04/2015 18:57||Next GOC DB entry - for overnight "Network problems continue at RAL"|
|9/04/2015 09:15||CASTOR tests all reporting green, errors seen last night no longer visible.|
|9/04/2015 09:30||Meeting in production area. There was concern about high CPU load on 2 switches and unbalanced network links. Decided to run a few more Castor tests, run ping tests and then reboot the Z9000 #2.|
|9/04/2015 09:58||Update to Tier1 Dashboard "There are problems on the Tier1 network. We are working on it and expect a further update later today."|
|9/04/2015 10:00||ping tests OK, MB reboots the second Z9000. After a settling down period and further tests showing green, it was decided to end the downtime at 11:00. VOs informed.|
|10/04/2015 17:00||Final problems with Disk servers resolved. (Castor Team - following notification of LHCb problem from RN).|
|13/04/2015 12:00||In response to a ticket from t2k, the non-LHC VOs were re-enabled on the batch farm.|
The opportunity was being taken during a planned outage of Castor to make a change to the underlying network while Castor was down. This network change was planned to take place at the start of the Castor intervention (i.e. when Castor was down but before its changes were being made). The Castor team were aware of the network intervention but did not include it in their written plan as it was believed to be minor.
The pre-existing plan was to change the connection to the network switch connecting the Castor headnodes from a single link to a resilient link (to a pair of 4810 switches). This change was scheduled to be done ahead of the Castor upgrade (but after Castor was shutdown). This simple change was not the subject of a separate change control, but was considered a 'standard change'. However, this change required some preparatory work (laying in cables etc.) which had not been completed by the end of the previous day. It was therefore agreed on the previous day that his change would be postponed.
However, staff arrived early on the morning of 8th April and the necessary preparations for the network change were completed before the planned Castor outage. Following a discussion on the morning of the 8th April, involving the person running the Castor change, the Admin on Duty and Fabric Team staff it was agreed the change would now go ahead.
It was also agreed on that morning to make a change to the network connectivity of the High Availability CVMFS stratum 1 servers. This involved moving their connections from one switch to another. This change was independent of the previous change, and did not directly involve the Castor system. However, it did involved moving cables from the switch used in the non-resilient connection to the Castor Headnodes. This change was agreed between the Fabric Team and the CVMFS service owner. This change was to take place following the first network change.
The first network change started early before Castor services were stopped. This initial change taking place at around (or even slightly before) the announced start time for the whole intervention. The first steps in the network change were to disconnect the existing link cable to the switch hosting Castor headnodes and reconnect that switch to the network mesh. Around this time network problems were found by staff working in offices. There was seen as a loss of access to the Castor headnodes. It is not now known if network problems were more widespread at this point. It is most likely this loss of access was caused by the network intervention.
Shortly thereafter the next step of the network change was undertaken. However, an error (re-plugging the wrong end of the cable) caused a network loop. This immediately caused a packet storm. In response to this the UK light router cut the link to the Tier1 network. (09:23GMT from switch logs). After around 30 seconds the staff making the change realized there was a problem, although were not aware of its full implications, and the cable was again disconnected. The cabling was subsequently put back to its original configuration. A member of staff arrived in the machine room to inform the person working on the network change that there were network problems.
From this point there were significant problems with the Tier1 network with an awareness of packet loss. Also seen was high load on the two 4810 switches (involved in the Castor network change) and the network change for the Castor headnodes was reverted. The network problems were contained within the Tier1 network. They did not propagate to the RAL site network. At around 11:45 it was decided to back out of the planned Castor update.
The problems persisted for some hours. During the afternoon, although there was still packet loss, the network was more stable. Efforts were focused on fixing the 'bypass' link to the OPN (to CERN) - as it was noted that this link had died. The paired link from the Tier to the UKLight router wouldn't come up. Time was spend during the afternoon trying to fix that problem as it was not clear to the person working on the problem that the rest of the Tier network was in trouble.
It became clear that one of the links to the UKLight router had failed some days earlier (on the 2nd April) and this was confusing the investigation of the problems on that link. Furthermore the exact patten of network breaks to services hosted on the Tier1 network during the day are not known, although there appears to be two extended outages affecting access to the Hermes database. It is not understood why this was the case.
During the afternoon effort re-focused on to the main Tier1 network - and it was realized that the the two z9000 switches were having problems. One of these (swt-z9000-1) was rebooted (Observium log gives 14:07 - possibly GMT though clock skew a possibility). This cleared up the network sufficiently to test Castor.
The following morning traffic patterns through second z9000 were still anomalous. This unit was then restarted (at 09:59) and that seemed to clear the problems completely.
In the morning the upgrade of Castor had been started. As a result of the network problems it was decided that the Castor update should be abandoned and the steps taken thus (rolling out some software updates) reverted. (No changes had been made to the Castor databases). During the afternoon problem were encountered with the Castor downgrade. The Quattor rollback reverted the RPMs but for some reason some symlinks were left in place.
There was some confusion about leaving Castor down overnight between the 8th/9th April. Rather than switching Castor off it was left running but declared down in GOC DB.
The network problems seen were almost certainly triggered by mis-connecting a cable causing a network loop. The resulting packet storm then put some network devices (notably the pair of z9000 switches) into a bad state. For much of the working day of 8th April it was not clear what the underlying problem was. During this time the Tier1 network was only partly functioning.
The analysis shows a lack of co-ordination between the network changes and the start of the Castor intervention. The network changes were not included in the written change plan and there were very late changes to what was to be done, both of which made the resulting problems more likely to occur. One of the network changes was backed out of the previous day and then re-instated that morning. The second was only agreed on the morning. The first network change was begun before Castor has been stopped and the Castor team were not aware it had begun. The second network change was undertaken before the first had been checked and fully verified to be completed successfully. Although the network changes were in themselves straightforward these conditions were more likely to lead to error and for those errors to produce a more complex problem.
The problem started during the morning but it was not until the end of the afternoon that a review meeting took place and made an assessment of the situation. The lack of senior staff present that day, combined with a generally lower number of staff present, partly underlies this. It is possible that had the situation been appropriately reviewed earlier then the focus of the team working on the problem could have been more focused on the key problem solved more quickly. This lack of co-ordination can also been seen in the confusion about the agreed state of Castor overnight as well as the omission of the restart of the batch jobs for the non-LHC VOs. The existing, but previously unnoticed, failure of of one of the multi-links to the UKLight router was a cause of confusion and should have been detected earlier.
The reversion of the Castor upgrade was problematic as detailed above. Although a backout procedure for the upgrade had been previously agreed it had not been tested. Notably the action that was taken as being straightforward (to revert the change in Quattor) did not completely reverse all the changes made.
|Reversion Plan for Castor Update had not been tested and did not run smoothly.||Ensure the Change Control Process examines whether reversion plans are tested appropriately - including software roll-back.||No|
|Lack of coordination of the overall plan led to actions being taken out of sequence and not being tested before the next step proceeded.||Update guidelines for interventions to ensure that there is always a clearly nominated coordinator for the overall change and that there are sufficient staff available to both deal with the planned changed as well as likely consequences in the event of a problem.||No|
|Lack of appropriate and timely review of the ongoing problems leading to effort not being directed on the main problem and confusion as to what was happening.||Ensure there is a check that there are enough staff available to cover not only the change itself but also in the event the change does not go smoothly. Set clearly defined escalation times for significant problems/outages to be reviewed as a guide to the Change Coordinator and Admin On Duty. Update documentation to this effect and ensure staff are aware.||No|
|Network changes were not part of the change plan and were not reviewed beforehand.||The process for assessing network changes to be reviewed. This to consider a tightening of those changes considered 'standard' and not needing a change control review.||No|
|Earlier failure of one of a pair of network links caused confusion.||Monitor for failures within multi-link network connections.||No|
Reported by: Alastair Dewhurst (9th April 2015) / Gareth Smith. (1st May 2015)
|Start Date||8th April 2015|
|Duration of Outage||24 hours|
|Root Cause||Human Error|
The following links are for internal use.
Change control for Castor upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=148453
Castor upgrade procedure: https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/CastorUpgradeTo211415
RT ticket tracking upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=149684