Tier1 Operations Report 2009-10-28

RAL Tier1 Operations Report for 28th October 2009.

This is a review of issues since the last meeting on 21st October.

The main ongoing issue is the investigations into the cause of the problems we have encountered with the databases behind the Castor service. Overall Castor itself has been working well this week. All services are up. The databases are hosted on alternative hardware which, while it does not offer the full level of resilience we would have liked, is known to be reliable. The full Castor service is available. Work is also ongoing to check, and where appropriate improve, the resilience of the disk systems now in use for the databases. Furthermore, a further spare (in case one of these 'temporary' systems fails) is being checked out. The Tier1 'disaster management process' continues to track this.

Air Conditioning Problems. The air conditioning has worked OK for many weeks. However, we continue to track the underlying issues that led to the outages a couple of months ago. Plans are ongoing to modify the BMS system to ensure it is more robust. (At present a restart of the BMS will stop the air-conditioning). There are also plans to increase the number of chillers.

Condensation water dripping into the tape robot: This also continues to be followed up. Gravity drains were installed in both condenser drains in the first floor atrium. Water detectors have been installed beneath both condensers and connected to alarms.

Swine ‘Flu. As previously reported: We continue to track this and ensure preparations are in place should significant portion of our staff be away or have to work offsite. We note the numbers are increasing nationally, and that a vaccine is now available.

There was an unscheduled outage to Castor Monday afternoon (26th October).Over the weekend there had been a failure of a fibrechannel port on the Storage Area Network. This failure blocked access to a disk used for making backups. Once the hardware problem was fixed on Monday morning it was necessary to reboot some nodes to regain access to the disk. None of this affected the services being delivered. However, following the reboots the distribution of the various database processes was not optimal. This in turn led to poor performance and in turn to a severe degradation of service on Monday afternoon. The resolution was to re-assign the database process to different nodes to re-balance the system.

Disk server gdss143.gridpp.rl.ac.uk was unavailable from Friday 23rd to Monday 26th October due to a hardware fault. This machine is part of atlasSimStrip.

A new version of the Castor SRM (2.8-2) has been rolled out for all castor instances on Monday and Tuesday this week (26/27 Oct).

WMS03 Outage from 30-Oct to 5-Nov. To enable hot-swappable disks. This time includes draining jobs out.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-alice, srm-cms, srm-dteam., srm-hone, srm-ilc, srm-mice, srm-minos	SCHEDULED	AT_RISK	27/10/2009 10:00	27/10/2009 12:00	2 hours	At Risk during update to SRM 2.8-2.
All Castor & CEs	UNSCHEDULED	OUTAGE	26/10/2009 15:35	26/10/2009 16:45	1 hour and 10 minutes	Investigating database access problems
srm-atlas, srm-lhcb	SCHEDULED	AT_RISK	26/10/2009 10:00	26/10/2009 12:00	2 hours	At Risk during update to SRM version 2.8-2