Tier1 Operations Report 2009-12-02

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 2nd December 2009.

This is a review of issues since the last meeting on 25th November.

Review of Issues during week 25th November to 2nd December.

  • One of the two nodes in the Oracle RAC that provides the 3D and lhcb-lfc services has shown memory errors. There was an unscheduled "At Risk" on the 3D systems during Tuesday (1st) to investigate this. However, the "At Risk" was not visible during the afternoon when the GOC DB went to a fail-over mode as a result of a problem. During the "At Risk" services continued as usual, but there would have been a service interruption should the remaining node have failed during the intervention.
  • The repack exercise on ALICE tapes has revealed a problem on three files which have been declared as lost. (The files were each on different tapes). There are no more tapes to repack for ALICE in the short term and this is not believed to indicate a much wider problem. REPACK is continuing for Atlas, and no further problems have been found.
  • There was a power cut at CERN last night (1/2 December). We are shown as failing some of the CE SAM tests during that period.

Current operational status and issues.

  • Double disk failure on gdss138, part of the LHCb_Dst space token (D1T0) with resulting data loss. This happened early on Monday morning (around 05:30 and 06:00). The disks have been replaced but during tests a further disk error was found. This disk will also be replaced and the server given a more extensive tests (around a week) before being returned to production.
  • Long standing Database Disk array problem: The test of the UPS bypass has been delayed and will now be scheduled early January. We now plan to turn off critical services and databases so that we are more able to recover in a controlled manner should there be problem during the test.
  • There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is still under investigation.
  • A mismatch between tape contents and Castor meta-data is being investigated. This dates from 2007 and has been found for CMS data. So far investigations have not found other evidence of this problem. It is still believed to affect only a small number of files.

Advanced warning:

  • None

Table showing entries in GOC DB starting between 25th November and 2nd December.

Service Scheduled? Outage/At Risk Start End Duration Reason
lhcb-lfc, 3D (lugh, ogma) UNSCHEDULED AT_RISK 01/12/2009 12:09 01/12/2009 17:00 4 hours and 51 minutes At-Risk to investigate the memory problems on Oracle RAC node behind these services.