Tier1 Operations Report 2009-11-11

RAL Tier1 Operations Report for 11th November 2009.

This is a review of issues since the last meeting on 4th November.

There is an ongoing issue with Batch Jobs ending up on 'wrong' node. On Thursday (5th November) the batch (PBS) server was restarted to try and fix the problem. The problem appears to be load related - it seems to occur only when the batch system is fully loaded. The PBS Server restart itself was problematic. Some weeks ago a 64-bit version of the server had been prepared for use. However, it turned out that the server had not been restarted to pick up the 64-bit version. We were still running using the 32-bit version. In order to restart the server it was necessary to gather the appropriate files together for the 32-bit version. (The transition between 32 and 64 bit versions of the server is not straightforward as the format of the files containing the state information changes between the two cases).

The main ongoing issue is the investigations into the cause of the problems with the databases behind the Castor service. Work continues around improving the resilience and recovery options for the current temporary arrangement. In parallel work has been going on to identify the fault, but this has still not concluded. Two of the (problematic) disk arrays are being set-up in the 'LPD' room in R89 and certification tests run on them. This is with a view to getting services back onto these systems if/when confidence is regained. A spare disk array has been set-up, and is being checked out, to act as a spare in case we suffer a failure of one the arrays temporarily hosting the databases.

There have been problems with the tape system during this last week. Tim worked with the hardware engineer to resolve this. It was caused by garbage being written to the end of some tapes. This led to problems when trying to use those tapes. The solution was to mark the particular tapes (around 150 of them) as read-only. The fault has now been traced to one tape drive. The data on the tapes has been checked that it can still be read, there is no data loss that we are aware of. The problem appeared as a migration to tape problem within Castor.

On Tuesday 10th November there was a problem with migration to tape for CMS that was fixed by a restart of the Castor 'mighunter' process on that instance.

Last Wednesday (4th Nov. - ongoing at the time of last week's meeting) there were problems with inbound transfers from Tier2s. This was traced to a networking problem. A configuration change on the 'UK Light' router made on Tuesday morning (the day before) had an unexpected side effect. The change was backed out during Wednesday afternoon and transfers from Tier2s resumed.

Disk server gdss403, part of Atlas MCDISK space token, had faulty memory and was out of production for some days. There was concern that this may have caused corruption of files on this server. Checksums were calculated and checked against information held by Atlas. This showed a small number of corrupt files. Following these checks the server was returned to production.

In the early hours of Tuesday 10th November there was failure of the Castor Information Provider (CIP). The virtual node this was running on stopped working. The service went down shortly after midnight. Following the restart and we were again passing SAM tests at around 4am.

Scheduled 'At Risk' periods were announced for quarterly Oracle patches to be applied to the Castor and 3D databases. During the updates for the Castor service (on Tuesday 10th November) some problems were encountered reconnecting to the databases and there was a short service outage.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
All CEs	SCHEDULED	AT_RISK	11/11/2009 09:30	11/11/2009 12:00	2 hours and 30 minutes	At Risk during application of OS kernel updates.
lhcb-lfc, lugh, ogma	SCHEDULED	AT_RISK	11/11/2009 09:00	11/11/2009 13:00	4 hours	At Risk during application of quarterly Oracle patches.
All Castor	SCHEDULED	AT_RISK	10/11/2009 13:00	10/11/2009 17:00	4 hours	Extension to At Risk for application of quarterly Oracle patches in order to reboot nodes to update OS kernel.
lfc, lfc-atlas, fts, ftm	SCHEDULED	AT_RISK	10/11/2009 09:00	10/11/2009 12:00	3 hours	Investigation into a problem on the Storage Area Network used to connect to the disk array hosting the LFC and FTS databases.
All Castor	SCHEDULED	AT_RISK	10/11/2009 09:00	10/11/2009 13:00	4 hours	At Risk during application of quarterly Oracle patches.
All Castor & FTS	UNSCHEDULED	AT_RISK	04/11/2009 13:29	04/11/2009 16:13	2 hours and 44 minutes	We are investigating a problem with inbound transfers from Tier2s at the moment. Outbound traffic, along with inbound traffic from T0 and T1s (except NDGF) is working OK.