Tier1 Operations Report 2009-10-07

RAL Tier1 Operations Report for 7th October 2009.

This is a review of issues since the last meeting on 30th September.

Current operational status and issues.

Major problems with the hardware underneath both the Castor Oracle databases and those used to support the LFC, FTS and 3D databases. All these services are currently unavailable. This is an drastic extension of a problem that occured starting on 10th September but was contained to just the mirrored set of disks under the Castor Databases. On Sunday 4th October the disk system that supports the primary disk subsusbstem for teh Cator databases fauikled. As the mirrored set were not in place owing to ongoing failures this halted the Castor services completely. Then around midday on Tuesday 6th October both the disk subsystems (primary and mirror) hosting the LFC, FTS and 3D databases failed within a few minutes of each other. At the moment we are restoring most of these databses to alternative hardware so as to restore services while we investigate the problems with the newer disk systems.
Problem seen for Atlas where reads from a particular service class are triggering Disk-to-Disk copies before delivering the data. This has resulted in a significantly increased number of timeouts.
Problem for LHCb with contention for access to files in a particular service class.
One of the nodes in the Oracle RAC that supports Atlas and LHCb has a problem and keeps rebooting. This node forms part of the voting system within the RAC and it will be necessary to have a downtime to re-configure a different node in its place.
The patched version of the SRM (2.8-1) has been installed for srm-atlas. Await installation for other SRMs (delayed by the database hardware problems).
Swine ‘Flu. As previously reported: We continue to track this and ensure preparations are in place should significant portion of our staff be away or have to work offsite.

Review of Issues during week 30 September to 7th October.

Some problems seen after the Castor CIP upgrade. These were finally resolved shortly before the meeting on the 30th. Noted here for completeness.
Thursday 1st October: Problem for LHCb with contention for access to files in a particular service class. Resolved by Castor team.
Friday 2nd October: BaBar disk server gdss165 sufferered two consecutive disk failures. A disk failed in this server on the 1st October. Whilst the rebuild did not start automatically there was a manual intervention and the array rebuilt OK. There was another disk failure in the same system on the morning of the 2nd. The array was again rebuilt but was taken it out of production while this took place. System put back in production by end of Friday afternoon.
There was a batch problem over the weekend with a log file on the PBS server exceeding its maximuum size. This was resolved following a callout. A more definitive fix has been to upgrade to a 64-bit version of the PBS server which has been done, profiting from the other problems.

Advanced warning:

There are no scheduled outages declared in the GOC DB. However, we will need to reschedule the installation of the updated SRM for the CMS, LHCb and GEN Castor instances.

Table showing entries in GOC DB starting between 30th September and 7th October.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
All Castor and CEs	UNSCHEDULED	OUTAGE	07/10/2009 14:00	08/10/2009 12:00	22 hours	Following the failure of the disk systems that host the Oracle databases behind Castor we are having to restore the databases from backup. We will run on alternative hardware temporarily while the current hardware problems are understood and the system re-certified.
FTS, FTM, LFCs, 3D (lugh, ogma).	UNSCHEDULED	OUTAGE	06/10/2009 18:00	07/10/2009 14:00	20 hours	Following the failure of the disk systems that host the Oracle databases behind these services we are having restore some of the databases from backup and will migrate to alternative hardware while the underlying problems are resolved.
FTS, FTM, LFCs, 3D (lugh, ogma).	UNSCHEDULED	OUTAGE	06/10/2009 14:03	06/10/2009 18:00	3 hours and 57 minutes	Outage while the problem on the Oracle systems behind the LFC and FTS systems is being investigated.
All Castor and CEs	UNSCHEDULED	OUTAGE	06/10/2009 14:00	07/10/2009 14:00	24 hours	Outage to investigate the ongoing problems with the hardware behind the Castor Oracle database.
FTS, FTM, LFCs.	UNSCHEDULED	OUTAGE	06/10/2009 12:25	06/10/2009 14:30	2 hours and 5 minutes	We have just had a problem on the Oracle systems behind the LFC and FTS systems. Being looked at.
srm-atlas, srm-lhcb, CE06, CE08	SCHEDULED	OUTAGE	06/10/2009 10:00	06/10/2009 12:00	2 hours	Outage for reconfiguration of the Oracle RAC behind the Atlas and LHCb Castor instances. This is to remove a faulty node within the RAC.
FTS	SCHEDULED	AT_RISK	06/10/2009 09:00	06/10/2009 12:00	3 hours	At Risk while channels for RAL for Atlas and LHCb drained out ahead of Castor intervention. Other channles will be unaffected.
All Castor and CEs	UNSCHEDULED	OUTAGE	05/10/2009 14:00	06/10/2009 14:00	24 hours	Ongoing problems with the hardware that underlies the Oracle databases behind Castor are being investigated. The cause for multiple failures is not yet understood and we are announcing an extended downtime as these investigations contine.
Castor CMS, LHCb and GEN instances plus CEs.	SCHEDULED	AT_RISK	05/10/2009 10:00	05/10/2009 10:30	30 minutes	At-risk to upgrade the Castor SRM's
lugh.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	05/10/2009 09:00	05/10/2009 12:00	3 hours	LHCb 3D database (LUGH) migration to 64-bit system.
lcgce06, lcgce07, lcgce08	UNSCHEDULED	OUTAGE	04/10/2009 15:47	05/10/2009 15:47	24 hours	CAstor Databases down
lcgce01	UNSCHEDULED	OUTAGE	04/10/2009 15:47	05/10/2009 15:47	24 hours	Castor Databases down
All Castor and CEs	UNSCHEDULED	OUTAGE	04/10/2009 14:16	05/10/2009 14:00	23 hours and 44 minutes	Castor down due to unknown problems with Oracle DBs. This is under investigation right now.
srm-atlas, CE06, CE08	UNSCHEDULED	AT_RISK	01/10/2009 13:30	01/10/2009 14:00	30 minutes	At-risk while the Castor SRM(atlas instance) is upgraded to the latest version
LFCs, FTS, Atlas 3D (ogma)	SCHEDULED	AT_RISK	01/10/2009 13:30	01/10/2009 16:30	3 hours	At Risk during updating of back end systems (Oracle ASM patches and OS kernel updates).
Whole Site	UNSCHEDULED	AT_RISK	01/10/2009 09:00	01/10/2009 12:00	3 hours	At Risk while (and after) network link to OPN doubled to increase bandwidth.
All CEs	UNSCHEDULED	AT_RISK	29/09/2009 16:01	30/09/2009 14:00	21 hours and 59 minutes	CEs At Risk as we are failing OPS VO SAM tests, but believe the CEs are working OK for other VOs. These failures follow modifications to the Castor Information provider which publishes storage information and are under investigation.
All Castor, CEs and FTS.	UNSCHEDULED	AT_RISK	29/09/2009 14:48	29/09/2009 16:00	1 hour and 12 minutes	We are seeing some problems on OPS VO SAM tests following the upgrade of the Castor Information Provider that publishes information on castor. Declared an At Risk while this is investigated.
All Castor	SCHEDULED	AT_RISK	29/09/2009 12:00	29/09/2009 14:00	2 hours	Castor at risk while CIP system is upgraded.
lcgwms02	SCHEDULED	OUTAGE	22/09/2009 16:00	30/09/2009 10:03	7 days, 18 hours and 3 minutes	Outage for making the disks on the host hotswappable. This includes time for draining beforehand and subsequent re-installation.

Tier1 Operations Report 2009-10-07

Contents

RAL Tier1 Operations Report for 7th October 2009.

Current operational status and issues.

Review of Issues during week 30 September to 7th October.

Advanced warning:

Table showing entries in GOC DB starting between 30th September and 7th October.

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools