Tier1 Operations Report 2010-01-27

RAL Tier1 Operations Report for 27th January 2010.

Review of Issues during week 20th to 27th January 2010.

Gdss70 (LHCbMDst - D1T1) was restored to production. This closes off an issue where we had been verifying checksums on two LHCb disk servers (gdss70, gdss79) that showed FSPROBE errors. No checksum discrepancies were found.
gdss148 (Babar) had its RAID card controller replaced following a failure to rebuild. Returned to service on 20th Jan.
gdsss66 (CMSFarmRead) had been out of production about a week following a FSPROBE error. It has had the memory replaced and was returned to service on 21st Jan.
A decision was taken to intervene on the FTS node that runs the FTS 'agents'. As reported last week this had a bad partition on each of the disks. There was an outage to the FTS on Thursday 21st Jan to move the agents to run on alternative hardware.
On Friday 22nd Jan the Atlas ScratchDisk had become full. Three more disk servers were added to the Service Class during the day.
On Monday 25th January there was a planned intervention on the UPS - which passed off without incident.

Current operational status and issues.

From Sunday evening, 24th January, we started draining the batch queues ahead of this week's major intervention which is now underway.
The rolling update to add extra RAM to the Oracle RAC nodes behind Casor has made progress. Six out of ten nodes have been done so far.
Long standing Database Disk array problem: The intervention on the UPS on Monday 25th January has made some reduction to the noise on the current provided by the UPS. We await a further analysis of the outcome. As planned the various databases are being migrated back to the disk arrays which will be powered from non-UPS power initially.
On 31st December a recurrence of the Castor/Oracle 'BigID' problem was seen. This is still under investigation.
There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is still under investigation with a low priority.

Advanced warning:

The overall approach is to try and get interventions done by the end of January and go for stable running during February up to LHC start-up. However, some interventions have slipped outside this time window.

Monday 1st February:
- Migration of 3D services back to original disk arrays.
- At Risk on Castor for modifications to Castor Information Provider (CIP).
During week starting 1st Feb: At Risk on Castor while memory is added to remaining nodes in the Oracle RAC back-end.
Tuesday 9th February. Between 07:00 and 10:00 there will be a Network intervention that will, for a half hour window within this time, break external connectivity to the Tier1. Am also expecting a break on the OPN link to CERN.

Entries in GOC DB starting between 20th and 27th January 2010.

The only unscheduled outage was to resolve problem with faulty disks on the system hosting the FTS agents.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgic01, lcglb01, lcglb02, lcgmon01, lcgvo-alice, lcgvo-s3-04, lcgvo0425, lcgwms01, lcgwms02, lcgwms03	SCHEDULED	AT_RISK	27/01/2010 09:00	27/01/2010 17:00	8 hours	During this day (while Castor and Batch services are also down) kernel updates will be applied to these machines.
All Castor	SCHEDULED	OUTAGE	27/01/2010 08:00	28/01/2010 17:00	1 day, 9 hours	Castor services down during migration of databases to another disk array plus checking disk servers and kernel updates.
LFC, LFC-Atlas	SCHEDULED	OUTAGE	27/01/2010 08:00	27/01/2010 19:00	11 hours	LFC unavailable while the database is migrated to another disk array.
FTS, FTM	SCHEDULED	OUTAGE	27/01/2010 07:00	27/01/2010 19:00	12 hours	Outage of FTS while its back end database is migrated to a different disk array.
Whole site	SCHEDULED	AT_RISK	25/01/2010 08:00	25/01/2010 19:00	11 hours	At Risk during work to reduce electrical noise from UPS.
All CEs (batch)	SCHEDULED	OUTAGE	24/01/2010 20:00	28/01/2010 17:00	3 days, 21 hours	Batch system drained ahead of intervention on Castor on LFC. The batch engine will be upgraded and kernel updates applied to worker nodes.
All Castor	SCHEDULED	AT_RISK	21/01/2010 14:30	21/01/2010 15:30	1 hour	At Risk to Castor during updates to the Castor Information Provider (CIP) including changes to improve its resilience.
FTS, FTM	UNSCHEDULED	OUTAGE	21/01/2010 07:00	21/01/2010 09:00	2 hours	Intervention to resolve problem with disk drives in node containing FTS agents. The first hour will be a drain of the transfers followed by the hardware fix.
Castor Atlas instance	SCHEDULED	AT_RISK	20/01/2010 09:00	20/01/2010 10:00	1 hour	At Risk during update of Castor SRM to version 2.1.8-17.
Castor CMS, LHCb & GEN insances	SCHEDULED	AT_RISK	19/01/2010 09:00	20/01/2010 12:00	1 day, 3 hours	t Risk during update of Castor SRM to version 2.1.8-17.
All Castor	SCHEDULED	AT_RISK	18/01/2010 09:00	22/01/2010 16:00	4 days, 7 hours	At Risk on Castor while memory is added to nodes in the Oracle RAC back-end. This will be done node-by-node and services will fail-over to other RAC nodes as each is upgraded.

Tier1 Operations Report 2010-01-27

Contents

RAL Tier1 Operations Report for 27th January 2010.

Review of Issues during week 20th to 27th January 2010.

Current operational status and issues.

Advanced warning:

Entries in GOC DB starting between 20th and 27th January 2010.

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools