Tier1 Operations Report 2009-11-25

RAL Tier1 Operations Report for 25th November 2009.

This is a review of issues since the last meeting on 18th November.

The problem of batch work occasionally ending up on the 'wrong' node has been monitored. An analysis at the end of last week found no jobs running on the 'wrong' hosts during the previous week. Regard this as solved, although will continue to monitor.

The problem on both WMS01 & WMS02 last week (18th November) has been resolved. The problem was with the maximum number of files in a folder (effectively number of jobs in the WMS). This has been resolved by increasing the limit and a more aggressive clean-up. This is a known problem in the WMS although the patch is still in certification.

There was a short At Risk on the Atlas Castor instance on Friday 20 November. A problem had been found with LSF filling up a partition that was trapped before it became critical. The At Risk was necessary to resolve the problem.

Long standing Database Disk array problem: Following the discovery of the noise on the current supplied by the UPS we are awaiting a test of the UPS bypass. Such a test has been scheduled twice in the last week but had to be cancelled each time. We are currently looking to do this on the 7th December (to be confirmed). This test is important to confirm the UPS (or rather the mismatch between it and the load) is the cause of the noise. In parallel work is ongoing to prepare a solution.

Problem on the Atlas Castor instance during the evening of Monday 23rd and morning of Tuesday 24th November. This was caused by high load on the Atlas SRM database leading to inefficiencies. This has been alleviated by a minor reconfiguration of the SRM along with moving the Oracle processes to another node in the RAC with more memory. The number of Atlas FTS channels from CERN to RAL were also reduced to prevent a recurrence. The number of FTS channels has not yet been restored to its previous value. The situation is being monitored, including checking the effect on any backlog of file transfers as LHC data arrives.

There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is under investigation.

A mismatch between tape contents and Castor meta-data is being investigated. This dates from 2007 and has been found for CMS data. So far investigations have not found other evidence of this problem. It is still believed to affect only a small number of files.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-atlas	UNSCHEDULED	AT_RISK	20/11/2009 10:00	20/11/2009 10:30	30 minutes	Short At Risk for minor re-configuration of LSF scheduler used within the Atlas Castor instance.
All site	UNSCHEDULED	AT_RISK	19/11/2009 10:00	19/11/2009 11:00	1 hour	At Risk for test bypass of Uninterruptible Power Supply extended owing to delay in making test.
All site	SCHEDULED	AT_RISK	19/11/2009 08:30	19/11/2009 10:00	1 hour and 30 minutes	At Risk during test bypass of Uninterruptible Power Supply.