Tier1 Operations Report 2010-01-20

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 20th January 2010.

Review of Issues during week 13th to 20th January 2010.

  • Problem with database behind FTS service. This caused an outage (see GOC DB list) on Friday 15th. There was also a minor recurrence on Monday (18th).
  • Failure of disk server gdss380 (LHCbMDst) in the evening of Wednesday 13th. Two faulty drives were found. These were replaced but it was not back in services until Monday 18th.
  • Monday (18th) gdss160 (LHCbDst) has been taken out of production due to problems rebuilding after a disk failure. It was returned to production the following morning.

Current operational status and issues.

  • gdss148 (babar) is having its RAID card controller to be replaced following a failure to rebuild.
  • gdsss66 (CMSFarmRead) has been out of production about a week following a FSPROBE error. It has had the memory replaced and is undergoing final checks before going back into production.
  • Tuesday (19th) a key FTS node was accidentally rebooted. By chance this has shown up a problem in the mirrored disks on that system. There is a bad partition on each of the disks. The system is still running OK and this will be resolved during next week's outage.
  • FSPROBE errors reported on gdss79 (LHCbDst) and gdss70 (LHCbMDst - D1T1). LHCb have responded stating that teh discrepancies between the checksums they provided and our calculated ones can be explained. Following this gdss79 was restuurned to service. However, gdss70 has shown further FSPROBE errors and the hardware is still under investigation.
  • Long standing Database Disk array problem: Following the successful test of the UPS bypass on 5th January we have scheduled to migrate the databases back. This will initially be to non-UPS power. Once the UPS problems are resolved the disk arrays will be moved back to UPS power. (See advanced warning section below)
  • On 31st December a recurrence of the Castor/Oracle 'BigID' problem was seen. This is still under investigation.
  • There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is still under investigation with a low priority.
  • Configuration issue on CREAM CE (lcgce01) caused problems for Monte-Carlo production jobs for CMS. This will be fixed during the outage next week.
  • Issues with the WMSs reported over the holiday are awaiting installation of a patch. This will also be fixed during the outage next week.

Advanced warning:

The overall approach is to try and get interventions done by the end of January and go for stable running during February up to LHC start-up.

  • At Risk on Castor while memory is added to nodes in the Oracle RAC back-end. This is being done on node-by-node and services will fail-over to other RAC nodes as each is upgraded.
  • Thursday 21st January. "At Risk" for Castor Information Provider (CIP) Upgrade.
  • Monday 25th January. Migration of 3D services back to original disk arrays. Also At Risk during the installation of a longer power cable into the R89 UPS.
  • Wednesday/Thursday 27/28 January preceded by a farm drain. This intervention was delayed by just over a week from the date discussed at last week's meeting. As before, this is a two-day outage to carry out a significant amount of work, including:
    • Migrating Oracle databases for Castor, LFC & FTS back to their original disk arrays.
    • FSCK all disk servers, update kernels.
    • Update batch engine to 64-bit (requires a farm drain), kernel updates on worker nodes.
    • Various other updates to CEs, WMSs and some network reconfiguration.
  • Tuesday 9th February. Between 07:00 and 10:00 there will be a Network intervention that will, for a half hour window within this time, break external connectivity to the Tier1. Am also expecting a break on the OPN link to CERN.

Entries in GOC DB starting between 13th and 20th January 2010.

  • The scheduled outage for the UPS maintenance in 14th January had to be extended.
  • The only other unscheduled outage was for the FTS problem.


Service Scheduled? Outage/At Risk Start End Duration Reason
Castor Atlas SCHEDULED AT_RISK 20/01/2010 09:00 20/01/2010 10:00 1 hour At Risk during update of Castor SRM to version 2.1.8-17.
Castor GEN, CMS, LHCB SCHEDULED AT_RISK 19/01/2010 09:00 20/01/2010 12:00 1 day, 3 hours At Risk during update of Castor SRM to version 2.1.8-17.
All Castor SCHEDULED AT_RISK 18/01/2010 09:00 22/01/2010 16:00 4 days, 7 hours At Risk on Castor while memory is added to nodes in the Oracle RAC back-end. This will be done node-by-node and services will fail-over to other RAC nodes as each is upgraded.
fts UNSCHEDULED OUTAGE 15/01/2010 12:34 15/01/2010 16:00 3 hours and 26 minutes There is a problem on the FTS that is under investigation.
lcgwms03 SCHEDULED AT_RISK 14/01/2010 14:30 14/01/2010 16:30 2 hours glite-WMS update (fix for ICE bug)
Whole Site UNSCHEDULED AT_RISK 14/01/2010 12:27 14/01/2010 15:00 2 hours and 33 minutes maintenance work on the UPS is taking a lot longer than expected. This is an extension to the At Risk period for this work.
Whole Site SCHEDULED AT_RISK 14/01/2010 10:00 14/01/2010 12:00 2 hours At Risk during maintenance on UPS.
lhcb-lfc.gridpp, lugh SCHEDULED AT_RISK 13/01/2010 11:00 13/01/2010 15:00 4 hours An engineer is coming to investigate memory errors on one (of a pair) of the Oracle RAC nodes behind this service.