RAL Tier1 Incident 20150408 network intervention preceding Castor upgrade
Contents
RAL-LCG2 Incident 20150408 network intervention preceding Castor upgrade
Change control for Castor upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=148453
Castor upgrade procedure: https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/CastorUpgradeTo211415
RT ticket tracking upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=149684
The Castor team were aware of the network intervention but did not include it in their plan as it was believed to be minor.
Description
Impact
The entire Tier 1 and all services were declared in downtime until the following day.
Timeline of the Incident
When | What |
---|---|
8/04/2015 8:10 | Condor set not to start jobs for non-LHC VOs. Reminder email sent to gridppp users mailing list. |
8/04/2015 10:00 | Downtime starts in the GOCDB. Very shortly afterwards network connectivity between the offices and the machine room dropped. Castor team (RA) spoke to Fabric. MB was surprised that the network had gone down but he also believed Castor was already down. |
8/04/2015 10:31 | GOC DB entry added "Networking problem at RAL, we are investigating" |
8/04/2015 10:15 - 10:40 | Network is restored and Castor is brought down cleanly. DB team asked to start work. RA commits quattor change for software upgrade (as planned). |
8/04/2015 11:05 | (Thousands of) alarms due to network problems. |
8/04/2015 11:45 | Castor change is abandoned and rollback starts. Note: DB team never started upgrade, the time was spent on DB backups. |
8/04/2015 13:40 | JK puts entire site into downtime until 18:00. |
8/04/2015 14:00 | MB, TF and JA spend several hours attempting to restore functionality and stability to routers and core switches. |
8/04/2015 17:00 - 17:20 | Review meeting with MB, JA, JK and RA. Network has been reverted to original state. We agreed to bring Castor back up. |
8/04/2015 17:30 | Castor is brought up, however basic Castor tests do not run cleanly even with no load on service. People have gone home so Castor is left 'up' but site is declared in downtime until 14:00 the following day. |
8/04/2015 18:07 | Next GOC DB entry "Network problems continue at RAL" |
8/04/2015 17:30 - 21:30 | AD speaks to RA and JK to understand extent of problems. We check if there is any disaster recovery documentation on twiki (there is but it is from 2009). Email out plan for following day. |
8/04/2015 18:57 | Next GOC DB entry - for overnight "Network problems continue at RAL" |
9/04/2015 09:15 | CASTOR tests all reporting green, errors seen last night no longer visible. |
9/04/2015 09:30 | Meeting in production area. There was concern about high CPU load on 2 switches and unbalanced network links. Decided to run a few more Castor tests, run ping tests and then reboot the Z9000 #2. |
9/04/2015 09:58 | Update to Tier1 Dashboard "There are problems on the Tier1 network. We are working on it and expect a further update later today." |
9/04/2015 10:00 | ping tests OK, MB reboots the second Z9000. After a settling down period and further tests showing green, it was decided to end the downtime at 11:00. VOs informed. |
10/04/2015 17:00 | Final problems with Disk servers resolved. (Castor Team - following notification of LHCb problem from RN). |
13/04/2015 12:00 | In response to a ticket from t2k, the non-LHC VOs were re-enabled on the batch farm. |
Incident details
The opportunity was being taken during a planned outage of Castor to make a change to the underlying network while Castor was down. This network change was planned to take place at the start of the Castor intervention (i.e. when Castor was down but before its changes were being made).
The pre-existing plan was to change the connection to the network switch connecting the Castor headnodes from a single link to a resilient link (to a pair of 4810 switches). This change was scheduled to be done ahead of the Castor upgrade (but after Castor was shutdown). This simple change was not the subject of a separate change control, but was considered a 'standard change'. However, this change required some preparatory work (laying in cables etc.) which had not been completed by the end of the previous day. It was therefore agreed on the previous day that his change would be postponed.
However, staff arrived early on the morning of 8th April and the necessary preparations for the network change were completed before the planned Castor outage. Following a discussion on the morning of the 8th April, involving the person runing the Castor change, the Admin on Duty and Fabric Team staff it was agreed the change would now go ahead.
It was also agreed on that morning to make a change to the network connectivity of the High Avaialbility CVMFS stratum 1 servers. This involved moving their connections from one switch to another. This change was independent of the previous change, and did not directly involve the Castor system. However, it did involved moving cables from the switch used in the non-resilient connection to the Castor Headnodes. This change was agreed between the Fabrice Team and the CVMFS service owner. This change was to take place following the first network change.
The first network change started early before Castor services were stopped. This initial change taking place at around (or even slightly before) the announced start time for the whole intervention. The first steps in the network change were to disconnect the existing link cable to the switch hosting Castor headnodes and reconnect that switch to the network mesh. Around this time network problems were found by staff working in offices. There was seen as a loss of access to the Castor Headnodes. It is not now known if network problems were more widespread at this point. It is most likely this loss of access was caused by the network intervention.
Within around 5 minutes(??) the next step of the network change was undertaken. However, an error (re-plugging the wrong end of the cable) caused a network loop. This immediately caused a packet storm. In response to this the UK light router cut the link to the Tier1 network. (Time to be confirmed). After around 30 seconds the staff making the change realised there was a problem and the cable was again disconnected. At the time of the post mortem discussion it was not clear whether the the cabling was then correctly changed to the new configuration for the second change or just put back as it was. A member of staff arrived in the machine room to inform the person working on the network change that there were network problems.
From this point there were significant problems with the Tier1 network with an awareness of packet loss. Also seen was high load on the two 4810 switches (involved in the Castor network change) and the network change for the Castor headnodes was reverted. The network problems were contained within the Tier1 netwrk. They did not propagate to the RAL site network.
The problems persisted for some hours. During the afternoon, although there was still packet loss, the network was more stable. Efforts were focused on fixing the 'bypass' link to the OPN (to CERN) - as it was noted that this link had died. The paired link from the Tier to the UKlight router wouldn't come up. Time was spend during the afternoon trying to fix that problem as it was not clear to the person working on the problem that the rest of the Tier network was in trouble.
It became clear that one of the links to the UKLight router had failed some days earlier (on the 2nd April) and this was confusing the investigation of the problems on that link. Furthermore the exact patten of network breaks to services hosted on the Tier1 network during the day are not known, although there appears to be two extended outages affecting access to the Hermes database. It is not understood why this was the case.
During the afternoon effort re-focused on to the main Tier1 network - and it was realised that the the two z9000 switches were having problems. Around 4pm rebooted the first (time TBC). This cleared up the network sufficiently to test Castor.
The following morning traffic patterns through second z9000 were still anomalous. This unit was then restarted and that seemed to clear the problems completely.
In the morning the upgrade of Castor had been started. As a result of the network problems it was decided that the Castor update should be abandoned and the steps taken thus (rolling out some software updates) reverted. (No changes had been made to the Castor databases). Duiring the afternoon problem were encountered with the Castor downgrade. The Quattor rollback reverted the RPMs but for some reason some symlinks were left in place.
There was some confusion about leaving Castor down overnight betwen the 8th/9th April. Rather than switching Castor off it was left running but declared down in GOC DB.
Analysis
The network problems seen were almost certainly triggered by mis-connecting a cable causing a network loop. The resulting packet storm then put some network devices (notably the pair of z9000 switches) into a bad state. For much of the working day of 8th April it was not clear what the underlying problem was. During this time the Tier1 network was only partly functioning.
The analysis shows a lack of co-ordination between the network changes and the start of the Castor intervention. That this occurred was made more likely by late changes to the plan. One of the network changes was backed out of the previous day and then re-instated that morning. The second was only agreed on the morning. The first network change was begun before Castor has been stopped and the Castor team were not aware it had begun. The second network change was undertaken before the first had been checked and fully verified to be completed successfully. Although the network changes were in themselves straightforward these conditions were more likely to lead to error and for those errors to produce a more complex problem.
The problem started during the morning but it was not until the end of the afternoon that a review meeting took place and made an assessment of the situation. The lack of senior staff present that day, combined with a generally lower number of staff present, partly underlies this. It is possible that had the situation been appropraitely reveiwed earlier then the focus of the team working on the problem could have been more focused on the key problem solved more quickly. This lack of co-ordination can also been seen in the confusion about the agreed state of Castor overnight as well as the omission of the restart of the batch jobs for the non-LHC VOs.
There reversion of the Castor upgrade was problematic as detailed above. Although a backout procedure for the upgrade had been previously agreed it had not been tested. Notably the action that was taken as being straightforward (to revert the change in Quattor) did not completely reverse all the changes made.
prforming.
Action: Being short of staff affected things more than you might expect.
- Lack of review at (say) lunchtiem on first day
- Less technical backup for those firefighting....
- Lack of notes/timeline information for the PM.
Follow Up
Issue | Response | Done | |
---|---|---|---|
Reversion Plan for Castor Update had not been tested and did not run smoothly. | Ensure the Change Control Process examines whether reversion plans are tested appropriately - inclusing sofwatre roll-back. | No | |
There were few additional staff (including senior staff) available when the change went wrong. This became apparant in the lack of co-ordination of the response. | There was a lack of co-ordination as teh change started. | Changes must be clear who will co-ordinate them. | No |
Ensure the Change Control Process checkes that there are enough staff available to cover not only the change itself but also in the event the change does not go smoothly. | No |
Related issues
List any related issue and provide links if possible. If there are none then remove this section.
Reported by: Your Name at date/time
Summary Table
Start Date | 18th Febuary 2014 |
Impact | Select one of: >80%, >50%, >20%, <20% |
Duration of Outage | 2 X 3hour outages, 6hours in total |
Status | Draft |
Root Cause | Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load |
Data Loss | Yes/No |