Tier1 Operations Report 2010-02-10

RAL Tier1 Operations Report for 10th February 2010.

Following the restoration of services last Tuesday (2nd February) services have been largely stable.

Network interventions on Tuesday 9th February overran by an hour. The problems were on the OPN router which has 2 CPU cards whereas most of the other routers have only one. Having 2 CPU cards changes the upgrade procedure. In order to carry out the upgrade one of the CPU cards was removed from the OPN router. This CPU card can be upgraded offline but will need another outage to re-install it. The batch work was paused during the scheduled intervention period, but we did not manage to extend the pause for the extra hour.
Friday 5th February: A machine rebooted in the Neptune Oracle RAC. The RAC re-configuration caused a 10 minute failure of all Castor services. Neptune is set-up so that it delays the reconfiguration for 10 minutes. And while this delay is in progress, the database does not do anything.
A couple of short breaks during database issues (overnight 3/4 Feb, around 10:30 on 4th Feb). Both resolved OK by database team.
On Thursday 4th February there was an update to the Castor Information Provider (CIP).

As reported last week, the Castor system is running with less resilience than hoped. However, the EMC disk arrays have better performance than the arrays temporarily in use up to then. These arrays are currently not on UPS power. Work is ongoing to investigate the cause of the lack of resilience by an audit of the settings and building a replica test system.
At 16:23 on Friday 5th February one of the transformers tripped which caused a loss of power to equipment on one circuit within the computer building. This included air-conditioning units in the LPD machine room, but did not trip off Tier1 equipment (other equipment in the machine room did go off). To minimize further system downtime in the LPD machine room and maintain machine room temperature all door where opened. At about 17:05 the transformer was switched to backup feed and power was restored to failed circuits. We await further investigation/fix of the problem on the transformer or associated circuitry.
On Tuesday 9th February gdss294 (cmsFarmRead machine - d0t1) had a kernel panic and has been taken out of production.
On 31st December a recurrence of the Castor/Oracle 'BigID' problem was seen. This is still under investigation.
There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is still under investigation.

The following has been scheduled in the GOC DB:

The water pumps for the building air-conditioning are being replaced. The new pumps will be brought online on Thursday 11th February. An "At Risk" has been declared for the Tier1 for 24 hours from 9am Thursday.
At Risk for Castor for 2 hours from 12:00 on Thursday 11 February to replace one of the Oracle RAC nodes which has a known hardware fault.

The following items remain to be scheduled:

There remains a final reconfiguration of the site BDII to do with improved resilience of the CIP (Castor Information Provider)that needs to be carried out.
Investigations into the lack of resilience of the Castor Oracle infrastructure may produce a requirement for an intervention. In addition the following changes in this system have not yet been carried out.
At Risk to increase the RAM in the remaining Oracle RAC nodes.
Adding the second CPU back into the UKLIGHT router.

Three UNSCHEDULED outages

The Castor Information Provider update (an At Risk) was delayed as a result of last week's problems. This was then done at short notice.
An At Risk was declared from Friday over the weekend following the failure of a transformer at the end of the afternoon.
The network intervention on 9th February overran.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Whole Site	UNSCHEDULED	OUTAGE	09/02/2010 09:07	09/02/2010 09:45	38 minutes	Outage during work on Network. Both upgrades to the router and resilience changes for the OPN link followed by upgrades to the RAL site router. Internally the RAL Tier1 will continue functioning. Update: router work has overrun and we are extending this downtime
Whole Site	SCHEDULED	AT_RISK	09/02/2010 08:45	09/02/2010 10:00	1 hour and 15 minutes	At Risk following planned network outage.
Whole Site	SCHEDULED	OUTAGE	09/02/2010 07:15	09/02/2010 08:45	1 hour and 30 minutes	Outage during work on Network. Both upgrades to the router and resilience changes for the OPN link followed by upgrades to the RAL site router. Internally the RAL Tier1 will continue functioning.
FTS	SCHEDULED	OUTAGE	09/02/2010 06:15	09/02/2010 07:15	1 hour	Drain of FTS ahead of scheduled network intervention.
Whole Site.	UNSCHEDULED	AT_RISK	05/02/2010 17:26	08/02/2010 13:00	2 days, 19 hours and 34 minutes	We have had a partial power failure at RAL which has affected the air conditioning to one of our machine rooms. At the moment, power is restored by being re-routed. The site should be considered to be 'at risk' until the cause of this power failure can be understood and power returned to the normal supply.
All Castor	UNSCHEDULED	AT_RISK	04/02/2010 13:00	04/02/2010 14:00	1 hour	At-risk for castor services while the Castor Information Provider(CIP) is updated.