RAL Tier1 Incident 20090810 Air Conditioning failure

From GridPP Wiki
Jump to: navigation, search

Air Conditioning Failures

Site: RAL-LCG2

Incident Date: 2009-08-10 & 1009-08-12/13

Severity: Tier 1 Disaster Management Process Level 3.

Service: Storage (Castor) and Batch

Impacted: All local VOs

Incident Summary: Air conditioning failure on Monday 10th August from which systems were recovered by early that evening. A second failure during the night of 11/12 August resulting in prolonged outage of Castor and batch services.

Type of Impact: Down

Incident duration: First incident 5 hours. Second incident 6 days.

Report date: 2009-08-18

Reported by: Gareth Smith

Status of this Report: Actions in progress. Main points of report reviewed internally.

Related URLs: Tier1 Blog entries at:

Post Mortem of Disk Server Loss at:

Incident Overview:

During the first incident on Monday 10th August the temperatures started rising at around 13:00, although Tier1 staff were not notified until just before 14:00. A decision was made to shed batch load initially. However, as this was ineffective in containing the temperatures, all batch systems were switched off, Castor stopped and disk servers powered down. Various operation issues meant that the air conditioning was not restarted until 14:45. The Tier1 restart (power on) commenced at 15:45. Full service was restored at 19:00 although some disk servers were unavailable (none of which were in Disk1Tape0 service classes). The systems ran "At Risk" overnight.

The second failure occured at 11pm on the night of 11/12 August (based on when temperatures started to rise). Tier1 staff were called out for a subsidiary problem at about 23:40. Oncall person investigated and realized temperatures rising. Contacted other Tier1 staff and Operations staff. An initial attempt to remotely power down Tier1 equipment was not successfull. Staff attended on site and powered equipment down in the 'HPD' (High Power Density) part of the machine room. The following day the Tier1 disaster plan was invoked.

During Thursday and Friday the disk servers within the Tier1 were checked out for problems arising from the incident.

On Monday 17th the decision was taken to restart services. The outage on the Atlas, CMS and LHCb Castor instances was ended at 16:00 local time. Some problems with the startup led to the remaining Castor instance not being available until the following morning. Batch services were also restarted the following morning, with all services available by 10am. One disk server (belonging to Atlas) has failed with loss of data. Details of this disk server failure are in a separate incident report.


Investigations into the cause of the Air Conditioning failure were protracted. Unfamiliarity of the new installation, along with having to ensure contractual issues while the building is still under a warranty had to be respected. Experiment representatives were asked at the regular Tier1-Experiments Liaison meeting held on the 12th August for their requirements and time pressures. The Tier1 Disaster Management process was invoked, with meetings held on the 12, 13th and 18th August. At the meeting on Thursday (13th August) it was decided that, given the state of understanding at that point, the most likely scenario was to restart services on Monday (17th.) On that Monday a review meeting within the Tier1 team agreed to proceed with the restart of services.

The cause of the first failure on Monday 10th August was a a reload of the Building Management System (BMS) that controls the air conditioning system. The reload caused the pumps to stop. The chillers detected a resulting low pressure and also shutdown. Once the system was off a deadlock resulted with the pumps stopped because of a valve closed by the chillers and the chillers stopped because of the lack of pressure from the pumps. A manual intervention was necessary to restart.

The second incident on the night of the 12/13th was triggered by the water pressure controls. A sensor reported an overpressure leading to the BMS (Building Management System) shutting down the system. Since then the upper limit on the pressure sensor has been increased. Furthermore the control system has been changed such that this sensor no longer automatically cuts the system, but makes a callout. We are informed that this does not affect the safety of the system. However, an independent measure showed that the water pressure was high at the time of the incident and the cause of that is not understood.

Future mitigation:

Issue Response
Relibility of the Air Conditioning System. Each of the two air conditioning failures had a different cause. In the first case the restart of the BMS trigerred the shutdown. This cause is understood and mitigated by appropriate operational procedures. In the second outage caused by an overpressure, the modification has been made such that this will no longer automatically shutdown the system. A further review of the air conditioning system, looking for single points of failure etc. is planned.
Insufficient monitoring of temperatures and asociated callout in the new building. The lack of sufficient monitoring has been addressed by the central operations team that manages the computer room. The Tier1 has also set-up its own independent monitoring of temperatures linked to its existing callout infrastructure.
Insufficiently quick response to Air Conditioning problems in the new machine room. Central Operations and site maintenance teams and are reviewing procedures. Some changes are already in place.
Lack of ability to manually initiate a rapid shutdown and power down of equipment in the new machine room. A procedure to power down the Tier1 was under development at the time of the second failure, but was not fully in place. A procedure to carry out a soft (software) shutdown of the disk servers and batch workers (systems located in the High Power Density or "HPD" room) have been produced and documented. Similar procedures for equipment in the other room (Lower Power Density ("LPD") and UPS rooms) remain to be set-up.
Lack of Automated response to temperature problems in new machine room. The rapid rise in temperature seen in the 'HPD' room, around 1C every three minutes, means that an automated response is required. An initial version of this has now been deployed. This makes a soft intervention by stopping batch work, followed (if further temperature rise) by a hard power down of the equipment. Further improvements to this remain to be agreed and implemented to make the procedure more robust and reduce its chance of triggering on false alarms.
Length of time required to restart the Tier1 despite having checked out hardware beforehand. The details of the issues encountered in the restart of Castor on 2009-08-17 are documented in the timeline below. Several issues were encountered, some of which could only be found at the time of a full restart. The updating of the CA list, rather than the CRL lists caused a significant delay. This mistake was more likely in this case as the CA list was also due for an update. Following this the Castor team believed the CRLs were up to date and therefore looked elsewhere for the cause of the problems. Making the request verbally was the correct approach in this case as the aim was to rapidly restore the service. However, the procedure should be to modified follow the verbal request with a ticket. The validation of the CRLs, and if necessary updating them, following a reboot of the disk servers and other nodes should be implemented.

Related issues: None.

Timeline

Date Time Comment
Actually Started 2009-08-10 13:00 (approx) Temperature rising in computer room.
Fault first detected (by Tier1 team) 2009-08-10 14:00 Informed of Air Conditioning problem.
First Advisory Issued 2009-08-10 14:15 Broadcast as Unscheduled Outages announced via GOC DB.
First Intervention 2009-08-10 14:15 (approx) Shutting down batch work. (Rapidly followed by shutting down and powering down disk systems.)
Fault Fixed 2009-08-17 N/A An idea of the fault was found on 2009-08-10. However, it was only on 2009-08-17 that there was sufficient confidence in the understanding of the problem to resume operations.
Announced as Fixed 2009-08-18 10:00 All services up following second A/C failure.
Downtime(s) Logged in GOCDB 2009-08-10 14:15 (??)to 19:00 Unscheduled outage. All Castor and batch (CEs).
E-mail to gridpp-users@jiscmail.ac.uk 2009-08-10 To be completed Multiple mails sent.
E-mail to atlas-uk-comp-operations@cern.ch 2009-08-10 To be completed Multiple mails sent.
EGEE Broadcast 2009-08-10 15:45 Plus further broadcasts.

Incident details Timeline - 1st A/C Failure:

Date Time Who/What Entry
2009-08-10 14:00 Martin Bly Informed of Air Conditioning problem.
2009-08-10 14:15 - 14:30 Tier1 staff Put CEs and SRMs in Downtime in GOC DB until 18:00 (local time.) E-mails to gridpp-users and Atlas UK Comp Operations. Daily WLCG meeting informed. Batch work stopped, Castor stopped. Turned off Castor certification systems. Power off batch machines, disk servers, Castor systems.
2009-08-10 14:45 David Corney Both chillers are now working downstairs. Hopefully room will start to cool again. We wait for a more stable situation. However, we have already powered down all the batch workers and Castor (including disk servers).
2009-08-10 15:09 Hiten Patel E-mail from Hiten: “The chilled water system shutdown for reasons that we think we understand. Temperature has come down to NORMAL so we can now resume load.” Note this was sent to the “Atlas Platform Managers” list, which does not include the AoD. I (gareth) received this privately, AoD did not receive either this notification or that sent at 13:58.
2009-08-10 15:45 Tier1 staff Fabric team restarting Castor disk servers. AoD Sent EGEE broadcast summarising situation.
2009-08-10 16:00 Tier1 staff Disk servers powered back on. Fabric team doing a trawl to find which servers have problems. By 16:20 27 disk servers not responding (causes unknown as yet).

17:00 Chris reports that Castor is looking good (although still a good list of servers unavailable.) He is not sure about the status of tape migration (yet).

2009-08-10 17:45 AoD Extend the outage in the GOC DB until 20:00. An update was been sent (EGEE broadcast and e-mail to GridPP-users and Atlas UK Comp Operations lists). Chris Kruk reports there is now only one D1T0 server (gdss124) not available. We are awaiting the results of the work going on on that server before re-enabling Castor.
2009-08-10 17:45 Tier1 Staff castor has been restarted. SAM tests OK for Castor (SRMv2 tests). Derek enabling FTS. (5 minutes later, can see the transfers going through OK).
2009-08-10 19:00 Tier1 Staff Batch system started at 18:55. Outage ended in GOC DB at 19:00 (set to At Risk until tomorrow midday). E-mail to users by AoD.

Incident details Timeline - 2nd A/C Failure:

Date Time Who/What Entry
2009-08-11 23:40 James Thorne (Tier1 On-call) Called out for disk server down. On investigating realised temperatures rising. Informs others (Computer Room operations team; Tier1 Manager (Andrew Sansum)). Attempted remote shutdown but this did not work.
2009-08-12 00:30 James Thorne (Tier1 On-call) Arrives on site, shortly followed by Tier1 Fabric Manager (Martin Bly). Start power down of systems locally. Soft shutdown of disk and castor systems, hard shutdown of batch CPU.
2009-08-12 04:00 Tier1 Staff (Martin Bly, James Thorne) Leave site. No sign of fix of Air Conditioning.
2009-08-12 11:30 David Corney, Andrew Sansum, Gareth Smith Incident Review Meeting - defined us to be at level 3 in Tier1 Disaster plan. Conclude we cannot restart service until we have enhanced monitoring and shutdown capability (1-2 days) and assurance that the machine room is able to offer a stable service (no ETA).
2009-08-13 11:30 Tier1 Disaster management team including representatives from outside the Tiier1 (GridPP, experiments STFC EScience department). Incident Review Meeting at level 3 in Tier1 Disaster plan.


2009-08-14 11:30 RAL EScience Staff involved Review of status of Air Conditioning systems with input from RAL engineer and operations manager.
2009-08-17 11:00 RAL Tier1 Staff Formally agree to startup (which has been initialised) and look at plans for enhanced manual monitoring of systems for next couple of evenings. An automated power down based on temperature has now been implemented.
2009-08-17 Lunchtime RAL Tier1 Castor team All CASTOR passed internal tests involving rfio quite early (before lunchtime). At this point we decided to open the SRM’s. However, once the SRMs were opened we conducted SAM tests which initially gave mixed results: CMS and LHCb seemed fine; ATLAS produced timeouts; GEN produced gSOAP errors. The problem on ATLAS was quickly identified at a problem on the srmDaemons all of which had failed to establish database connections. The reason why is unclear, but the logs did give the problem immediately
2009-08-17 Early afternoon RAL Tier1 Castor team During the next few hrs we saw random failures on OPS tests. These indicated expired CRL’s. Fabric team were asked to update the CRL’s on all disk servers. CMS also reported a build up of the migration queue
2009-08-17 Mid afternoon Tier1 Team At this point there was a comms failure, and the CA’s were updated on all disk servers, not the CRLs.
2009-08-17 16:00 RAL Tier1 Staff Castor instances for Atlas, CMS and LHCb up and taken out of 'Outage' in the GOC DB. Some problems remain on the GEN instance and batch work was not restarted.
2009-08-17 17:00 (approx) Tier1 Castor Team Tim asked another member of the castor team to restart rtcpclientd daemon since this was the cause of the increasing migration candidates. However this was not performed due to other prblems. Meanwhile, work continued on GEN SRM’s. No explanation was found for the gSOAP errors and they seemed to occur almost immediately after restart. Note the Rm and Ls requests were working, but Puts and Gets failed.
2009-08-17 18:00 (approx) Tier1 Castor Team Found that the GEN srmDaemons had not established database connections. These were restarted and OPS tests got further(CRL failures). About 18:15 asked On-Call person to run the fetchCRL script on all disk servers. Follwoing this OPS tests for all instances passed successfully.
2009-08-18 08:00 (approx) Tier1 Castor Team rtcpclientd was restarted on all Castor instances and migrations started working fine.
2009-08-18 09:45 RAL Tier1 Staff Castor team report all Castor instances OK. Batch systems restarted 09:35. Outage for CEs cleared in GOC DB at 09:56.
2009-08-18 11:00 Tier1 Disaster management team including representatives from outside the Tier1 (GridPP, experiments STFC EScience department). Incident Review Meeting downgraded seriousness to level 2 in Tier1 Disaster plan.
2009-09-03 14:30 Members of Tier1 Team Post Mortem review of the incident. Post Mortem report (this page) updated and ongoing actions noted.