Tier1 Operations Report 2010-01-27

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 27th January 2010.

Review of Issues during week 20th to 27th January 2010.

  • Gdss70 (LHCbMDst - D1T1) was restored to production. This closes off an issue where we had been verifying checksums on two LHCb disk servers (gdss70, gdss79) that showed FSPROBE errors. No checksum discrepancies were found.
  • gdss148 (Babar) had its RAID card controller replaced following a failure to rebuild. Returned to service on 20th Jan.
  • gdsss66 (CMSFarmRead) had been out of production about a week following a FSPROBE error. It has had the memory replaced and was returned to service on 21st Jan.
  • A decision was taken to intervene on the FTS node that runs the FTS 'agents'. As reported last week this had a bad partition on each of the disks. There was an outage to the FTS on Thursday 21st Jan to move the agents to run on alternative hardware.
  • On Friday 22nd Jan the Atlas ScratchDisk had become full. Three more disk servers were added to the Service Class during the day.
  • On Monday 25th January there was a planned intervention on the UPS - which passed off without incident.

Current operational status and issues.

  • From Sunday evening, 24th January, we started draining the batch queues ahead of this week's major intervention which is now underway.
  • The rolling update to add extra RAM to the Oracle RAC nodes behind Casor has made progress. Six out of ten nodes have been done so far.
  • Long standing Database Disk array problem: The intervention on the UPS on Monday 25th January has made some reduction to the noise on the current provided by the UPS. We await a further analysis of the outcome. As planned the various databases are being migrated back to the disk arrays which will be powered from non-UPS power initially.
  • On 31st December a recurrence of the Castor/Oracle 'BigID' problem was seen. This is still under investigation.
  • There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is still under investigation with a low priority.

Advanced warning:

The overall approach is to try and get interventions done by the end of January and go for stable running during February up to LHC start-up. However, some interventions have slipped outside this time window.

  • Monday 1st February:
    • Migration of 3D services back to original disk arrays.
    • At Risk on Castor for modifications to Castor Information Provider (CIP).
  • During week starting 1st Feb: At Risk on Castor while memory is added to remaining nodes in the Oracle RAC back-end.
  • Tuesday 9th February. Between 07:00 and 10:00 there will be a Network intervention that will, for a half hour window within this time, break external connectivity to the Tier1. Am also expecting a break on the OPN link to CERN.

Entries in GOC DB starting between 20th and 27th January 2010.

  • The only unscheduled outage was to resolve problem with faulty disks on the system hosting the FTS agents.
Service Scheduled? Outage/At Risk Start End Duration Reason
lcgic01, lcglb01, lcglb02, lcgmon01, lcgvo-alice, lcgvo-s3-04, lcgvo0425, lcgwms01, lcgwms02, lcgwms03 SCHEDULED AT_RISK 27/01/2010 09:00 27/01/2010 17:00 8 hours During this day (while Castor and Batch services are also down) kernel updates will be applied to these machines.
All Castor SCHEDULED OUTAGE 27/01/2010 08:00 28/01/2010 17:00 1 day, 9 hours Castor services down during migration of databases to another disk array plus checking disk servers and kernel updates.
LFC, LFC-Atlas SCHEDULED OUTAGE 27/01/2010 08:00 27/01/2010 19:00 11 hours LFC unavailable while the database is migrated to another disk array.
FTS, FTM SCHEDULED OUTAGE 27/01/2010 07:00 27/01/2010 19:00 12 hours Outage of FTS while its back end database is migrated to a different disk array.
Whole site SCHEDULED AT_RISK 25/01/2010 08:00 25/01/2010 19:00 11 hours At Risk during work to reduce electrical noise from UPS.
All CEs (batch) SCHEDULED OUTAGE 24/01/2010 20:00 28/01/2010 17:00 3 days, 21 hours Batch system drained ahead of intervention on Castor on LFC. The batch engine will be upgraded and kernel updates applied to worker nodes.
All Castor SCHEDULED AT_RISK 21/01/2010 14:30 21/01/2010 15:30 1 hour At Risk to Castor during updates to the Castor Information Provider (CIP) including changes to improve its resilience.
FTS, FTM UNSCHEDULED OUTAGE 21/01/2010 07:00 21/01/2010 09:00 2 hours Intervention to resolve problem with disk drives in node containing FTS agents. The first hour will be a drain of the transfers followed by the hardware fix.
Castor Atlas instance SCHEDULED AT_RISK 20/01/2010 09:00 20/01/2010 10:00 1 hour At Risk during update of Castor SRM to version 2.1.8-17.
Castor CMS, LHCb & GEN insances SCHEDULED AT_RISK 19/01/2010 09:00 20/01/2010 12:00 1 day, 3 hours t Risk during update of Castor SRM to version 2.1.8-17.
All Castor SCHEDULED AT_RISK 18/01/2010 09:00 22/01/2010 16:00 4 days, 7 hours At Risk on Castor while memory is added to nodes in the Oracle RAC back-end. This will be done node-by-node and services will fail-over to other RAC nodes as each is upgraded.