Tier1 Operations Report 2011-02-09

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 9th February 2011

Review of Issues during the week from 2nd to 9th February 2011.

  • On Thursday (3rd Feb) GDSS66 (CMSFarmRead) was taken out of service as it was reporting SCSI errors. There were no migration candidates on the server. It was returned to service on Tuesday (8th).
  • During the evening of Monday/Tuesday 7/8 Feb there was a problem with the BDII service on one of the pair of site BDII nodes.
  • On Tuesday (8th Feb) GDSS84 (CMSFarmRead) was taken out of production to investigate memory errors. There are two un-migrated files on it.
  • On Tuesday (8th Feb) GDSS94 (CMSWanIn) taken out of production as a raid disk array rebuild was only going very slowly. It was returned to production this morning (9th Feb.)
  • On Tuesday (8th Feb) there was a problem accessing the LFC reported by T2K. The was a certificate issue. The notification of a certificate update had not been picked up.
  • Changes:
    • An update to the gridFTP component within Castor, initially for an LHCb service class and then for all of Castor (on Monday 7th) has resolved the checksumming problem that was causing LHCb tape migrations to block after failed (i.e. incomplete) transfers of the files into Castor. This also resolved a specific bug seen when dealing with checksums that contained a specific pattern of zeros (problem seen by Atlas).
    • On 3rd February there was an At Risk on Castor while the system that manages configurations on Castor servers (Puppet) was successfully updated to a new version and a new central server.
    • On Tuesday 8th February modified network parameters were rolled out for CMSWanIn and CMSWanOut service classes to improve wide area transfer rates.
    • Today (Wednesday 9th Feb) there is an outage on the LFC, FTS and 3D services for an Oracle database update (10.2.0.5).

Current operational status and issues.

  • As reported last week a tape fault was discovered that has resulted in the loss of 78 LHCb files. The fault was caused by a fault tape drive that overwrote part of the data on the tape. A Post Mortem for this incident is in preparation at:
 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20110202_Tape_Data_Loss_LHCb
  • GDSS496 (CMSFarmRead) remains out out of production on Thursday 6th January following a problem. A Post Mortem has been prepared for this failure. See:
 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20110106_CMS_Disk_Server_GDSS496_Data_Loss
  • GDSS283 (CMSFarmRead) which had failed on 25th December with the loss of 30 CMS files: A Post Mortem has been prepared. See:
 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101225_CMS_Disk_Server_GDSS283_Data_Loss

A replacement server has been prepared to go into CMSFarmRead and should go into service today (9th Feb).

  • We are aware of a problem with the Castor Job manager that can occasionally hang up. The service recovers by itself after around 30 minutes. An automated capture of more diagnostic information is in place ready for the next occurrence.
  • We have previously reported a problem with disk servers becoming unresponsive. We believe we understand the problem and are testing the fix which has been rolled out to all CMS disk servers. This problem (and fix) is still being monitored.

Declared in the GOC DB

  • Wednesday 16th Feb. At Risk for first part of update of FTS back-end (runs FTS agents) to a Quattorised box. (Will be completed about a week later.)

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Update WAN tuning parameters on remaining disk servers (Rest of CMS - Feb 15th, All others - March 1st.)
  • FTS re-configuration to make use of channel Groups (for CMS). Probably Wed 2nd March
  • Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem for GEN instance. Proposed for Tuesday 15th Feb.
  • Castor Name Server upgrade - Possibly Tuesday 1st March.
  • Castor upgrades (possibly directly to 2.1.10) - Possibly late March (27/28/29 during technical stop.)
  • Switch Castor to new Database Infrastructure - Possibly end March.
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change). (OGMA done).
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Firmware updates for central networking components (likely to have some short network breaks - maybe in March)
  • Small isolating transformers have been received that should eliminate the electrical noise seen by the disk arrays hosting the Oracle databases. An opportunity needs to be found to make use of these.

Entries in GOC DB starting between 2nd and 9th February 2011.

There were no unscheduled entries this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgftm, lcgfts, lfc-atlas.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lhcb-lfc.gridpp.rl.ac.uk, SCHEDULED OUTAGE 09/02/2011 09:00 09/02/2011 16:00 7 hours Outage on LFC, FTS and 3D services while Oracle databases are updated to version 10.2.0.5.
srm-cms SCHEDULED AT_RISK 08/02/2011 09:00 08/02/2011 16:00 7 hours At-risk to roll out changes to WAN tuning for disk servers in cmsWanIn and cmsWanOut.
srm-atlas, srm-cms, srm-lhcb SCHEDULED AT_RISK 07/02/2011 10:00 07/02/2011 12:00 2 hours At Risk during roll-out of gridftp upgrade to ATLAS, CMS and the remaining LHCb disk servers to resolve checksum problem.
All castor (srm endpoints) SCHEDULED AT_RISK 03/02/2011 09:00 03/02/2011 16:00 7 hours At-risk to reconfigure all disk servers to use new puppet server

This version of the report incorrectly edited the following week, and reverted.