Tier1 Operations Report 2009-09-30

RAL Tier1 Operations Report for 30th September 2009.

This is a review of issues since the last meeting on 23rd September.

Current operational status and issues.

Some problems seen after the Castor CIP upgrade. It looks like these were confined to the failure of OPS VO SAM tests and were finally resolved during this morning (30th September).
We await the patched version of the SRM that will fully resolve problems seen by Atlas after the SRM 2.8-0 upgrade some ten days ago. In the meantime a workaround is in place.
Problems on the disk system underneath the Oracle RAC are being investigated. Failures have recurred similar to those of 10th September. The Oracle patch applied since the first incident has meant the database services have continued running and there has been no impact on the service. However, we are running without a mirror copy of the Castor databases at the moment.
One of the nodes in the Oracle RAC that supports Atlas and LHCb has a problem and keeps rebooting. This node forms part of the voting system within the RAC and it will be necessary to have a downtime to re-configure a different node in its place. Most likely will schedule for next Tuesday 6th (October).
Swine ‘Flu. As reported last week: We continue to track this and ensure preparations are in place should significant portion of our staff be away or have to work offsite.

Review of Issues during week 23 to 30 September.

Batch system problems. At the end of last week (Thursday) significant problems were being encountered with the batch system. The PBS server was regularly crashing and we were unable to add more worker nodes into the batch system. It was running with very little (almost no) SL5 capacity. On Friday morning (25th September) the PBS server was updated (a minor version release) and this resolved the issue which appears to have been caused by a version mis-match between clients and server.
Problems were encountered on the Atlas Castor system on Tuesday morning (29th) when writing into the Atlas MCDISK space token. The cause was traced to some disk servers being put into production with an incorrect LSF configuration.
There was a double disk failure on a server gdss126 (part of cmsWanOut) resulting in the loss of the filesystem. This is a d0t1 service class and as there were no files awaiting migration no data was lost.

Advanced warning:

The following outages have been declared in the GOC DB:

Thursday 1st October: At Risk for LFC, FTS, Atlas 3D for updating of back end systems (Oracle ASM patches and OS kernel updates)

Other items:

We are discussing arrangements for the migration of the LHCb 3D system ('LUGH') to a 64-bit Oracle system.
Tuesday 6th October. Possible outage for resolution of problem with rebooting node in Oracle RAC behind Castor Atlas and LHCb instances.

Table showing entries in GOC DB starting between 23rd and 30th September.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
All CEs	UNSCHEDULED	AT_RISK	29/09/2009 16:01	30/09/2009 14:00	21 hours and 59 minutes	CEs At Risk as we are failing OPS VO SAM tests, but believe the CEs are working OK for other VOs. These failures follow modifications to the Castor Information provider which publishes storage information and are under investigation.
All CEs, SRMs and FTS	UNSCHEDULED	AT_RISK	29/09/2009 14:48	29/09/2009 16:00	1 hour and 12 minutes	We are seeing some problems on OPS VO SAM tests following the upgrade of the Castor Information Provider that publishes information on castor. Declared an At Risk while this is investigated.
All Castor	SCHEDULED	AT_RISK	29/09/2009 12:00	29/09/2009 14:00	2 hours	Castor at risk while CIP system is upgraded.
All CEs (batch)	UNSCHEDULED	AT_RISK	25/09/2009 08:30	25/09/2009 11:00	2 hours and 30 minutes	At Risk for batch system. We are seeing some problems in the batch scheduler and are unable to increase batch capacity beyond a certain point. These problems will be investigated during this period.
WMS02	SCHEDULED	OUTAGE	22/09/2009 16:00	30/09/2009 10:03	7 days, 18 hours and 3 minutes	Outage for making the disks on the host hotswappable. This includes time for draining beforehand and subsequent re-installation.
Whole Site	SCHEDULED	AT_RISK	22/09/2009 08:00	22/09/2009 11:59	3 hours and 59 minutes	At Risk during and following test of UPS.
All Castor and batch	SCHEDULED	OUTAGE	22/09/2009 07:45	22/09/2009 10:00	2 hours and 15 minutes	Outage for tests of UPS in new Computer Room. During the tests transfers to/from Castor will be suspended.
FTS	SCHEDULED	AT_RISK	22/09/2009 07:00	22/09/2009 10:00	3 hours	Transfers to the RAL Tier1 will be drained out during this time. This is in connection with the scheduled outage of Castor while tests are made on the UPS in the new computer room. (Other transfers will continue as normal).

Tier1 Operations Report 2009-09-30

Contents

RAL Tier1 Operations Report for 30th September 2009.

Current operational status and issues.

Review of Issues during week 23 to 30 September.

Advanced warning:

Table showing entries in GOC DB starting between 23rd and 30th September.

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools