Tier1 Operations Report 2009-12-23

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 23rd December 2009.

This is a review of issues since the last meeting on 16th December.

Review of Issues during week 16th to 23rd December.

  • Repack operation found a problem on a single file for ILC which has been declared lost. This file dates from December 2008.
  • Following the intervention to reboot many nodes (including all disk servers) all services except the Castor 'GEN' instance were brought up on time on Tuesday 22nd December. The GEN instance was fixed later in the evening and the outage in the GOC DB ended on the morning of Wednesday 23rd December.
  • The problem with the UKlight link to Lancaster reported at last week's meeting was resolved on Thursday 17th December. We did get another report of a problem on the link during Sunday morning (20th Dec.) from our monitoring. However, the on-call person was unable to find anything wrong and the alarm cleared late morning.
  • The migration of the 'LSF licensing triplet' for Castor to new hardware was carried out as planned on Thursday 17th December. Also, updates to the Castor Information Provider (CIP) in order to run on newer more resilient hardware and to provide information for T2K were carried out on Monday 21st December.

Current operational status and issues.

  • FSPROBE errors reported on gdss79 (LHCbDst) on Thursday 17th December. Taken out of production for investigation. Have asked LHCb for a list of checksums to compare against.
  • The following items are unchanged since last week:
    • Long standing Database Disk array problem. (No update since last week's report).
    • There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is still under investigation.
    • A mismatch between tape contents and Castor meta-data is being investigated. This dates from 2007 and has been found for CMS data. So far investigations have not found other evidence of this problem. This affects 11 tapes with a total of 983 files on those tapes.
    • Configuration issue on CREAM CE (lcgce01) caused problems for Monte-Carlo production jobs for CMS. Awaiting application of the fix.

Advanced warning:

  • Tuesday 5th January: Test of UPS bypass. Databases will be stopped so will include a stop of services (Castor, LFC, FTS)
  • Establishing further plans for January.

Table showing entries in GOC DB starting between 16th and 23rd December.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, ce-ngs, lcgce02 UNSCHEDULED OUTAGE 22/12/2009 12:55 22/12/2009 16:00 3 hours and 5 minutes Following this morning's intervention we have a problem on the Castor GEN instance that is being investigated.
All castor and CEs SCHEDULED OUTAGE 22/12/2009 08:00 22/12/2009 12:53 4 hours and 53 minutes Stop of Castor for reboots of nodes (including disk servers). This also requires a stop of the batch system.
FTS SCHEDULED OUTAGE 22/12/2009 07:00 22/12/2009 13:00 6 hours Stop of FTS during Castor outage. Includes a drain of the FTS transfers ahead of the intervention.
All Castor SCHEDULED AT_RISK 21/12/2009 12:00 21/12/2009 13:00 1 hour At Risk for Castor during upgrade of the hardware behind the Castor Information Provider (CIP) and to start publishing T2K data.
lcgvo0597 SCHEDULED OUTAGE 17/12/2009 10:00 31/01/2010 18:00 45 days, 8 hours This SL4 VOBOX machine is being replaced by a SL5 VOBOX (lcgvo-alice). Created a long scheduled d/time and then will be removed from GOCDB.
All Castor and CEs UNSCHEDULED AT_RISK 17/12/2009 09:00 17/12/2009 10:00 1 hour Castor at-risk for upgrading the LSF license hosts.