Tier1 Operations Report 2009-11-25

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 25th November 2009.

This is a review of issues since the last meeting on 18th November.

Review of Issues during week 18th to 25th November.

  • The problem of batch work occasionally ending up on the 'wrong' node has been monitored. An analysis at the end of last week found no jobs running on the 'wrong' hosts during the previous week. Regard this as solved, although will continue to monitor.
  • The problem on both WMS01 & WMS02 last week (18th November) has been resolved. The problem was with the maximum number of files in a folder (effectively number of jobs in the WMS). This has been resolved by increasing the limit and a more aggressive clean-up. This is a known problem in the WMS although the patch is still in certification.
  • There was a short At Risk on the Atlas Castor instance on Friday 20 November. A problem had been found with LSF filling up a partition that was trapped before it became critical. The At Risk was necessary to resolve the problem.

Current operational status and issues.

  • Long standing Database Disk array problem: Following the discovery of the noise on the current supplied by the UPS we are awaiting a test of the UPS bypass. Such a test has been scheduled twice in the last week but had to be cancelled each time. We are currently looking to do this on the 7th December (to be confirmed). This test is important to confirm the UPS (or rather the mismatch between it and the load) is the cause of the noise. In parallel work is ongoing to prepare a solution.
  • Problem on the Atlas Castor instance during the evening of Monday 23rd and morning of Tuesday 24th November. This was caused by high load on the Atlas SRM database leading to inefficiencies. This has been alleviated by a minor reconfiguration of the SRM along with moving the Oracle processes to another node in the RAC with more memory. The number of Atlas FTS channels from CERN to RAL were also reduced to prevent a recurrence. The number of FTS channels has not yet been restored to its previous value. The situation is being monitored, including checking the effect on any backlog of file transfers as LHC data arrives.
  • There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is under investigation.
  • A mismatch between tape contents and Castor meta-data is being investigated. This dates from 2007 and has been found for CMS data. So far investigations have not found other evidence of this problem. It is still believed to affect only a small number of files.

Advanced warning:

  • Monday 7th December. At Risk for test of UPS bypass. (To be confirmed.)

Table showing entries in GOC DB starting between 18th and 25th November.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas UNSCHEDULED AT_RISK 20/11/2009 10:00 20/11/2009 10:30 30 minutes Short At Risk for minor re-configuration of LSF scheduler used within the Atlas Castor instance.
All site UNSCHEDULED AT_RISK 19/11/2009 10:00 19/11/2009 11:00 1 hour At Risk for test bypass of Uninterruptible Power Supply extended owing to delay in making test.
All site SCHEDULED AT_RISK 19/11/2009 08:30 19/11/2009 10:00 1 hour and 30 minutes At Risk during test bypass of Uninterruptible Power Supply.