Tier1 Operations Report 2010-01-06

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 6th January 2010.

This is a review of issues since the last meeting on 23rd December 2009.

Review of Issues during weeks 23rd December 2009 to 6th January 2010.

Overall systems worked well over the holiday. Cover being provided by the usual 'on-call' mechanism coupled with some daily checks. There was one visit to site to replace failed disk drives.

  • On 26th December there was a problem that led to Castor being unavailable for a few hours. Fixed by moving Castor nameserver database onto another node in the RAC.
  • On 28th December we responded to an Atlas GGUS ticket. The batch system was killing Atlas jobs. The maximum permissible wallclock time was extended.
  • On 30th December a problem on the SRMs caused a problem for Castor for Atlas which was not available for a few hours.
  • On 31st December / 1st January, a problem on the WMS led to them being unavailable to accept jobs at times. A temporary fix for this problem (a cleanup) had been in place place while a formal bugfix is awaited. Since then the temporary fix has been tightened up (a more agressive cleanup) - we are still waiting for the final fix.

Current operational status and issues.

  • FSPROBE errors reported on gdss79 (LHCbDst) on Thursday 17th December. Taken out of production for investigation. We have received a list of checksums from LHCb and are in the process of comparing against those of the files on disk.
  • Over the holiday (28th December) there were FSPROBE errors reported on another LHCb disk server gdss70 (LHCbMDst - D1T1). We have sent a list of files on that server to LHCb and asked for checksums for these as well. Note that we believe these FSPROBE errors are due to a faulty batch of three replacement disks. Two have caused these problems (and since been replaced), the third failed at initial use.
  • On 31st December a recurrence of the Castor/Oracle 'BigID' problem was seen. This is under investigation.
  • Long standing Database Disk array problem: The test of the UPS bypass on the morning of Tuesday 5th January (yesterday) established that:
    • The operation of switching in and out the bypass does not disrupt systems in the UPS room.
    • The noise seen on the current in the UPS room disappeared, confirming the UPS as the cause.
  • The following items are unchanged since the last meeting:
    • There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is still under investigation.
    • A mismatch between tape contents and Castor meta-data is being investigated. This dates from 2007 and has been found for CMS data. So far investigations have not found other evidence of this problem. This affects 11 tapes with a total of 983 files on those tapes.
    • Configuration issue on CREAM CE (lcgce01) caused problems for Monte-Carlo production jobs for CMS. Awaiting application of the fix.

Advanced warning:

  • Currently nothing declared in the GOC DB. However, we are in the process of establishing further plans for January. Significant work to be scheduled includes:
    • Migrating Oracle databases for Castor, LFC, FTS & 3D back to their original disk arrays. The tests yesterday essentially confirmed the problem does not reside with the disk arrays themselves. The disks arrays will be powered by a 'clean' (non-UPS) power until the UPS issue is fully fixed. The migration of the LFC will require an interruption to the service, the others can be done during an 'At Risk'.
    • A farm drain to enable an update to the batch engine and disk server checks.

Table showing entries in GOC DB starting between 23rd December 2009 and 6th January 2010.

Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor, CEs, LFC (lfc-atlas & lfc.gridpp), FTS & 3D (lhcb-lfc, lugh & ogma) SCHEDULED OUTAGE 05/01/2010 07:30 05/01/2010 12:28 4 hours and 58 minutes Test of UPS bypass. During this test all database systems will be shutdown. Castor, LFC, 3D and FTS services will be unavailable.
lcgfts.gridpp.rl.ac.uk, SCHEDULED OUTAGE 05/01/2010 06:30 05/01/2010 07:30 1 hour Drain of FTS ahead of planned outage.
lcgce01.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 22/12/2009 17:00 23/12/2009 09:00 16 hours Following the intervention on 22/12/2009, we have a problem on the Castor GEN instance that is being investigated.
ce.ngs.rl.ac.uk, lcgce02.gridpp.rl.ac.uk, srm-alice.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-hone.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 22/12/2009 16:00 23/12/2009 09:00 17 hours Following the intervention on 22/12/2009, we have a problem on the Castor GEN instance that is being investigated.
lcgvo0597.gridpp.rl.ac.uk, SCHEDULED OUTAGE 17/12/2009 10:00 31/01/2010 18:00 45 days, 8 hours This SL4 VOBOX machine is being replaced by a SL5 VOBOX (lcgvo-alice). Created a long scheduled d/time and then will be removed from GOCDB.