RAL Tier1 Operations Report for 19th October 2011

Review of Issues during the week 12th to 19th October 2011.

The problem that has been reported regarding the 11kV feed into R89 has been fixed. As reported before a transparent intervention had been made on this. Since then tests have put load on the relevant section of the bus-bar and regular checks have been made to ensure the partial discharge (which can be heard with specialist equipment) has not returned. A recent inspection confirmed that everything was OK the problem is now deemed fixed.
This morning, Wednesday 19th Oct, there was a problem with one of the database nodes behind the LFC/FTS services. This caused a problem for the FTS service which was unavailable from around 10:30 to midday. An outage was declared in the GOC DB for the FTS service. There was only a transient (few minute) interruption to the LFC service which we can see from the log files.
WMS02 was out of production for most of the week for maintenance work to resolve a problem of its database growing too large.

gdss353 (LHCbDst, D1T0) was out of production during the working day on Monday 17th Oct following a double disk failure.

Atlas report slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated. We are in contact with people in the US investigating a similar issue. The types of things being studied include disk server tcp/ip configurations.
We continue work with the Perfsonar network test system including understanding some anomalies seen.
A Post Mortem has been produced for the problems seen with the database behind the Atlas Castor instance of a a few weeks ago. This can be seen at:

Thursday 13th Sep: FSPROBE reported a problem on gdss396 (CMSWanIn). This resulted in the loss of a single CMS file (since copied in from elsewhere). The server has crashed under test and this is being followed up.
gdss296 (CMSFarmRead) has been out of production since 20th August. This server has also crashed under test and this is being followed up.
gdss456 (AtlasDataDisk) failed with a read only file system on Wednesday 28th September. Following draining investigations are ongoing for this system.
On Thursday (29th Sep) FSPROBE reported a problem on gdss295 (CMSFarmRead). This server has been put into test and these are ongoing.

Tuesday 1st November (TBC). Microcode updates for the tape libraries. No tape access from 09:00 to 13:00. (Delayed from Tuesday 18th Oct.)
Tuesday 1st November: There will be an intervention on the network link into RAL that will last up to an hour. This should not (in theory) affect our link. We plan to declare a site "Warning" and will drain out the FTS and pause batch work.

The following items are being discussed and are still to be formally scheduled and announced:

There are also plans to move part of the cooling system onto the UPS supply that may require a complete outage (including systems on UPS).
Switch Castor and LFC/FTS/3D to new Database Infrastructure. This will only proceed once the problem that caused the cancellation of the first stage of this work last week is understood and fixed.
Networking change required to extend range of addresses that route over the OPN.
Address permissions problem regarding Atlas User access to all Atlas data.
Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
Replace hardware running Castor Head Nodes (aimed for end of year).

There was one unscheduled outage in this last week. This was for the FTS following the problem with the database behind that service this morning.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgftm, lcgfts	UNSCHEDULED	OUTAGE	19/10/2011 10:30	19/10/2011 12:30	2 hours	Problem with database behind FTS service. Now under investigation.
lcgwms02	SCHEDULED	WARNING	13/10/2011 16:00	19/10/2011 12:45	5 days, 20 hours and 45 minutes	Drain and MySQL maintenance
lcgce09	SCHEDULED	WARNING	13/10/2011 14:00	18/10/2011 16:05	5 days, 2 hours and 5 minutes	Draining and gLite update