Tier1 Operations Report 2010-02-10

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 10th February 2010.

Review of Issues during week 3rd to 10th February 2010.

Following the restoration of services last Tuesday (2nd February) services have been largely stable.

  • Network interventions on Tuesday 9th February overran by an hour. The problems were on the OPN router which has 2 CPU cards whereas most of the other routers have only one. Having 2 CPU cards changes the upgrade procedure. In order to carry out the upgrade one of the CPU cards was removed from the OPN router. This CPU card can be upgraded offline but will need another outage to re-install it. The batch work was paused during the scheduled intervention period, but we did not manage to extend the pause for the extra hour.
  • Friday 5th February: A machine rebooted in the Neptune Oracle RAC. The RAC re-configuration caused a 10 minute failure of all Castor services. Neptune is set-up so that it delays the reconfiguration for 10 minutes. And while this delay is in progress, the database does not do anything.
  • A couple of short breaks during database issues (overnight 3/4 Feb, around 10:30 on 4th Feb). Both resolved OK by database team.
  • On Thursday 4th February there was an update to the Castor Information Provider (CIP).

Current operational status and issues.

  • As reported last week, the Castor system is running with less resilience than hoped. However, the EMC disk arrays have better performance than the arrays temporarily in use up to then. These arrays are currently not on UPS power. Work is ongoing to investigate the cause of the lack of resilience by an audit of the settings and building a replica test system.
  • At 16:23 on Friday 5th February one of the transformers tripped which caused a loss of power to equipment on one circuit within the computer building. This included air-conditioning units in the LPD machine room, but did not trip off Tier1 equipment (other equipment in the machine room did go off). To minimize further system downtime in the LPD machine room and maintain machine room temperature all door where opened. At about 17:05 the transformer was switched to backup feed and power was restored to failed circuits. We await further investigation/fix of the problem on the transformer or associated circuitry.
  • On Tuesday 9th February gdss294 (cmsFarmRead machine - d0t1) had a kernel panic and has been taken out of production.
  • On 31st December a recurrence of the Castor/Oracle 'BigID' problem was seen. This is still under investigation.
  • There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is still under investigation.

Advanced warning:

The following has been scheduled in the GOC DB:

  • The water pumps for the building air-conditioning are being replaced. The new pumps will be brought online on Thursday 11th February. An "At Risk" has been declared for the Tier1 for 24 hours from 9am Thursday.
  • At Risk for Castor for 2 hours from 12:00 on Thursday 11 February to replace one of the Oracle RAC nodes which has a known hardware fault.

The following items remain to be scheduled:

  • There remains a final reconfiguration of the site BDII to do with improved resilience of the CIP (Castor Information Provider)that needs to be carried out.
  • Investigations into the lack of resilience of the Castor Oracle infrastructure may produce a requirement for an intervention. In addition the following changes in this system have not yet been carried out.
  • At Risk to increase the RAM in the remaining Oracle RAC nodes.
  • Adding the second CPU back into the UKLIGHT router.

Entries in GOC DB starting between 27th January and 3rd February 2010.

Three UNSCHEDULED outages

  • The Castor Information Provider update (an At Risk) was delayed as a result of last week's problems. This was then done at short notice.
  • An At Risk was declared from Friday over the weekend following the failure of a transformer at the end of the afternoon.
  • The network intervention on 9th February overran.
Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site UNSCHEDULED OUTAGE 09/02/2010 09:07 09/02/2010 09:45 38 minutes Outage during work on Network. Both upgrades to the router and resilience changes for the OPN link followed by upgrades to the RAL site router. Internally the RAL Tier1 will continue functioning.

Update: router work has overrun and we are extending this downtime

Whole Site SCHEDULED AT_RISK 09/02/2010 08:45 09/02/2010 10:00 1 hour and 15 minutes At Risk following planned network outage.
Whole Site SCHEDULED OUTAGE 09/02/2010 07:15 09/02/2010 08:45 1 hour and 30 minutes Outage during work on Network. Both upgrades to the router and resilience changes for the OPN link followed by upgrades to the RAL site router. Internally the RAL Tier1 will continue functioning.
FTS SCHEDULED OUTAGE 09/02/2010 06:15 09/02/2010 07:15 1 hour Drain of FTS ahead of scheduled network intervention.
Whole Site. UNSCHEDULED AT_RISK 05/02/2010 17:26 08/02/2010 13:00 2 days, 19 hours and 34 minutes We have had a partial power failure at RAL which has affected the air conditioning to one of our machine rooms. At the moment, power is restored by being re-routed. The site should be considered to be 'at risk' until the cause of this power failure can be understood and power returned to the normal supply.
All Castor UNSCHEDULED AT_RISK 04/02/2010 13:00 04/02/2010 14:00 1 hour At-risk for castor services while the Castor Information Provider(CIP) is updated.