Tier1 Operations Report 2010-02-03

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 3rd February 2010.

Review of Issues during week 27th January and 3rd February 2010.

During the last week there has been a significant outage to services. A large amount of work was planned to take place Wednesday/Thursday 27/28 January. One crucial part of this, the migration of the Castor Oracle databases back to the EMC disk arrays encountered significant problems. The Oracle RAC / Storage Area Network / Disk Arrays system showed significant instabilities, despite being essentially the same configuration as was in use before last October. In order to achieve stability the multipath features of the SAN have been disabled and in some cases the disk arrays are not expected to automatically fail over. As a result Castor services were not restored until Tuesday 2nd February. Batch services were restarted shortly after Castor was available.

Work done during this outage included:

  • Migration of all Oracle databases (Castor, LFC, FTS, 3D) back to EMC disk arrays. The migrations for LFC & FTS went well. That for 3D took longer than expected but otherwise worked OK. As referred to above the Castor database migration was problematic.
  • Upgrades to run 64-bit version of batch scheduler.
  • 'FSCK' checks across all disk servers (including non-Castor disk servers).
  • Kernel updates across many systems including disk servers and worker nodes.
  • Updates to fix problems in WMSs, CREAM CE.

Current operational status and issues.

  • All systems are up (Wednesday morning, 3rd February). Work is ongoing to try and understand the causes of the instabilities that led to the delays in getting Castor services back last week. The Castor system is running with less resilience than hoped. However, the EMC disk arrays have better performance than the arrays temporarily in use up to then. These arrays are currently not on UPS power. A small amount of work that was planned has not been completed. This includes increasing the memory in all the castor Oracle RAC nodes (four out of ten nodes still to do), the Castor Information Provider (CIP) updates and replacing a database node which is occasionally displays hardware issues.
  • About a dozen corrupt files on GDSS66 (cmsFarmRead) disk server. These date from 18th January 2010 and correlate with a 'FSPROBE' error that led to memory on the system being replaced. The corruption was detected as the process that writes the files to tape verifies checksums where it can (which is if file was originally written using RFIO).
  • On 31st December a recurrence of the Castor/Oracle 'BigID' problem was seen. This is still under investigation.
  • There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is still under investigation with a low priority.

Advanced warning:

  • T.B.C. At Risk on Castor for modifications to Castor Information Provider (CIP). This work was delayed owing to the problems last week.
  • Tuesday 9th February. Network interventions. Outage declared between 07:15 and 08:45 with At Risk from then to 10:00. These include updates to the 'UKLIGHT' router which affects the OPN link, Changes that JANET need to make to improve resilience (including some resulting from the addition of the fail-over OPN link to CERN, and updates to the RAL Site Access Router.

Entries in GOC DB starting between 27th January and 3rd February 2010.

Summary of UNSCHEDULED outages

  • One Alice VO box rebuilt after it failed to restart following planned updates.
  • Long extension to outage for Castor (and therefore batch) following planned intervention. There were three separate extensions. One marked as 'scheduled'.
  • Extension to At Risk for 3D. (Note - incorrectly flagged as 'outage' rather than 'At Risk' when migration took longer to complete.)
Service Scheduled? Outage/At Risk Start End Duration Reason
3D (OGMA, LUGH,lhcb-lfc) UNSCHEDULED OUTAGE 02/02/2010 12:00 02/02/2010 17:18 5 hours and 18 minutes The 3D services are but the migration for OGMA (Atlas 3D) is taking longer than expected. For LUGH & LHCb-LFC a resilience test led to a requirement to re-synchronise one of the disk partitions. At Risk extended for 24hours to cover this work.
3D (OGMA, LUGH,lhcb-lfc) SCHEDULED AT_RISK 01/02/2010 17:00 02/02/2010 12:00 19 hours At Risk following migration of 3D Oracle databases back to original disk arrays - including final testing of production configuration.
3D (OGMA, LUGH,lhcb-lfc) SCHEDULED OUTAGE 01/02/2010 16:00 01/02/2010 17:00 1 hour Outage to complete migration of 3D Oracle databases back to original disk arrays.
All Castor & CEs. SCHEDULED OUTAGE 01/02/2010 13:50 02/02/2010 17:18 1 day, 3 hours and 28 minutes Due to further problems, we have to extend the castor downtime until Wednesday 14:00. We apologize for any inconvenience this will cause.
All Castor and Site-BDII SCHEDULED AT_RISK 01/02/2010 10:30 01/02/2010 11:30 1 hour At Risk during upgrade to the Castor Information Provider (CIP) that publishes storage information. This is to both update and improve resilience.
lcglb01 SCHEDULED AT_RISK 01/02/2010 10:00 01/02/2010 12:00 2 hours RAID SW failure; disk to be replaced
3D (OGMA, LUGH,lhcb-lfc) SCHEDULED AT_RISK 01/02/2010 08:30 01/02/2010 16:00 7 hours and 30 minutes First part of migration of 3D Oracle databases back to original disk arrays. (Followed by short outage.)
All Castor & CEs. UNSCHEDULED OUTAGE 29/01/2010 13:27 01/02/2010 14:00 3 days, 33 minutes A problem has been encountered in the work behind Castor (migrating databases to different hardware). As a consequence of this we are now extending the outage on Castor and batch. We hope Atlas and LHCb to be back at 14:00 GMT 1/02/2010. The other castor services may take longer.
All Castor & CEs. UNSCHEDULED OUTAGE 28/01/2010 17:00 29/01/2010 14:00 21 hours A problem has been encountered in the work behind Castor (migrating databases to different hardware). As a consequence of this we are now extending the outage on Castor and batch.
lfc-atlas.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lhcb-lfc.gridpp.rl.ac.uk, SCHEDULED AT_RISK 28/01/2010 10:00 28/01/2010 12:00 2 hours At Risk during application of kernel patches to front-end machines.
lcgvo-s3-04.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 27/01/2010 14:22 27/01/2010 16:00 1 hour and 38 minutes This machine failed to properly reboot following updates, so it needs a re-build
lcgic01.gridpp.rl.ac.uk, lcglb01.gridpp.rl.ac.uk, lcglb02.gridpp.rl.ac.uk, lcgmon01.gridpp.rl.ac.uk, lcgvo-alice.gridpp.rl.ac.uk, lcgvo-s3-04.gridpp.rl.ac.uk, lcgvo0425.gridpp.rl.ac.uk, lcgwms01.gridpp.rl.ac.uk, lcgwms02.gridpp.rl.ac.uk, lcgwms03.gridpp.rl.ac.uk, SCHEDULED AT_RISK 27/01/2010 09:00 27/01/2010 17:00 8 hours During this day (while Castor and Batch services are also down) kernel updates will be applied to these machines.
All Castor SCHEDULED OUTAGE 27/01/2010 08:00 28/01/2010 17:00 1 day, 9 hours Castor services downduring migration of databases to another disk array plus checking disk servers and kernel updates.
LFC, LFC-Atlas SCHEDULED OUTAGE 27/01/2010 08:00 27/01/2010 18:12 10 hours and 12 minutes LFC unavailable while the database is migrated to another disk array.
FTS, FTM SCHEDULED OUTAGE 27/01/2010 07:00 27/01/2010 18:13 11 hours and 13 minutes Outage of FTS while its back end database is migrated to a different disk array.
All CEs SCHEDULED OUTAGE 24/01/2010 20:00 28/01/2010 17:00 3 days, 21 hours Batch system drained ahead of intervention on Castor on LFC. The batch engine will be upgraded and kernel updates applied to worker nodes.