Tier1 Operations Report 2010-01-27
From GridPP Wiki
Contents
RAL Tier1 Operations Report for 27th January 2010.
Review of Issues during week 20th to 27th January 2010.
- Gdss70 (LHCbMDst - D1T1) was restored to production. This closes off an issue where we had been verifying checksums on two LHCb disk servers (gdss70, gdss79) that showed FSPROBE errors. No checksum discrepancies were found.
- gdss148 (Babar) had its RAID card controller replaced following a failure to rebuild. Returned to service on 20th Jan.
- gdsss66 (CMSFarmRead) had been out of production about a week following a FSPROBE error. It has had the memory replaced and was returned to service on 21st Jan.
- A decision was taken to intervene on the FTS node that runs the FTS 'agents'. As reported last week this had a bad partition on each of the disks. There was an outage to the FTS on Thursday 21st Jan to move the agents to run on alternative hardware.
- On Friday 22nd Jan the Atlas ScratchDisk had become full. Three more disk servers were added to the Service Class during the day.
- On Monday 25th January there was a planned intervention on the UPS - which passed off without incident.
Current operational status and issues.
- From Sunday evening, 24th January, we started draining the batch queues ahead of this week's major intervention which is now underway.
- The rolling update to add extra RAM to the Oracle RAC nodes behind Casor has made progress. Six out of ten nodes have been done so far.
- Long standing Database Disk array problem: The intervention on the UPS on Monday 25th January has made some reduction to the noise on the current provided by the UPS. We await a further analysis of the outcome. As planned the various databases are being migrated back to the disk arrays which will be powered from non-UPS power initially.
- On 31st December a recurrence of the Castor/Oracle 'BigID' problem was seen. This is still under investigation.
- There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is still under investigation with a low priority.
Advanced warning:
The overall approach is to try and get interventions done by the end of January and go for stable running during February up to LHC start-up. However, some interventions have slipped outside this time window.
- Monday 1st February:
- Migration of 3D services back to original disk arrays.
- At Risk on Castor for modifications to Castor Information Provider (CIP).
- During week starting 1st Feb: At Risk on Castor while memory is added to remaining nodes in the Oracle RAC back-end.
- Tuesday 9th February. Between 07:00 and 10:00 there will be a Network intervention that will, for a half hour window within this time, break external connectivity to the Tier1. Am also expecting a break on the OPN link to CERN.
Entries in GOC DB starting between 20th and 27th January 2010.
- The only unscheduled outage was to resolve problem with faulty disks on the system hosting the FTS agents.
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
lcgic01, lcglb01, lcglb02, lcgmon01, lcgvo-alice, lcgvo-s3-04, lcgvo0425, lcgwms01, lcgwms02, lcgwms03 | SCHEDULED | AT_RISK | 27/01/2010 09:00 | 27/01/2010 17:00 | 8 hours | During this day (while Castor and Batch services are also down) kernel updates will be applied to these machines. |
All Castor | SCHEDULED | OUTAGE | 27/01/2010 08:00 | 28/01/2010 17:00 | 1 day, 9 hours | Castor services down during migration of databases to another disk array plus checking disk servers and kernel updates. |
LFC, LFC-Atlas | SCHEDULED | OUTAGE | 27/01/2010 08:00 | 27/01/2010 19:00 | 11 hours | LFC unavailable while the database is migrated to another disk array. |
FTS, FTM | SCHEDULED | OUTAGE | 27/01/2010 07:00 | 27/01/2010 19:00 | 12 hours | Outage of FTS while its back end database is migrated to a different disk array. |
Whole site | SCHEDULED | AT_RISK | 25/01/2010 08:00 | 25/01/2010 19:00 | 11 hours | At Risk during work to reduce electrical noise from UPS. |
All CEs (batch) | SCHEDULED | OUTAGE | 24/01/2010 20:00 | 28/01/2010 17:00 | 3 days, 21 hours | Batch system drained ahead of intervention on Castor on LFC. The batch engine will be upgraded and kernel updates applied to worker nodes. |
All Castor | SCHEDULED | AT_RISK | 21/01/2010 14:30 | 21/01/2010 15:30 | 1 hour | At Risk to Castor during updates to the Castor Information Provider (CIP) including changes to improve its resilience. |
FTS, FTM | UNSCHEDULED | OUTAGE | 21/01/2010 07:00 | 21/01/2010 09:00 | 2 hours | Intervention to resolve problem with disk drives in node containing FTS agents. The first hour will be a drain of the transfers followed by the hardware fix. |
Castor Atlas instance | SCHEDULED | AT_RISK | 20/01/2010 09:00 | 20/01/2010 10:00 | 1 hour | At Risk during update of Castor SRM to version 2.1.8-17. |
Castor CMS, LHCb & GEN insances | SCHEDULED | AT_RISK | 19/01/2010 09:00 | 20/01/2010 12:00 | 1 day, 3 hours | t Risk during update of Castor SRM to version 2.1.8-17. |
All Castor | SCHEDULED | AT_RISK | 18/01/2010 09:00 | 22/01/2010 16:00 | 4 days, 7 hours | At Risk on Castor while memory is added to nodes in the Oracle RAC back-end. This will be done node-by-node and services will fail-over to other RAC nodes as each is upgraded. |