Tier1 Operations Report 2010-10-06

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 6th October 2010

Review of Issues during the week from 29th September to 6th October 2010.

  • Significant errors were reported by disk server GDSS408 (MCDisk) on Thursday 30th Sep. It was taken out of production in the morning and a faulty disk replaced. the system was returned to production in 'draining' (read only) mode that afternoon, and to full production on Friday (1st Oct.)
  • The problem with globus-url-copy not working from CERN on to SL5 64bit disk servers at RAL is believed to have been related to corrupted shared memory on the ATLAS instance. This was fixed yesterday (Tuesday 5th) and the remaining ATLAS SL5 disk servers are now being deployed to production.
  • The upgrade of the LHCb Castor instance to version 2.1.9 was completed successfully. A remaining issue with new functionality (checksumming of files on disk) is being worked on.
  • On Thursday 30th a number of Atlas & CMS disk servers that run SL5 were rebooted to pick up a new kernel. Their batch jobs were paused around the time of the reboots but it is known that some jobs did fail.
  • The power outage planned for the Atlas building over the weekend was cancelled.

Current operational status and issues.

  • After the LHCb castor instance was updated LHCb reported problems accessing their 3D databases. Investigations took place at the end of that week and the start of this. A problem has been traced to a configuration error by LHCb. At the moment we await LHCb fixing that and re-launching their batch work at RAL. The RAL Tier1 was blacklisted by LHCb while this problem was ongoing.
  • CE01 has a problem and is currently in unscheduled downtime. It appears to get into a state where it is endlessly renewing proxies and cannot process other work.
  • Gdss280 (CMSFarmRead) had showed FSProbe errors and was taken out of production on Thursday 19th August. As reported last week this server was returned to production on the morning of 15 Sep). The server again gave FSPROBE errors and was taken back out of production the next day (16th). 30 un-migrated files were lost. A review of the problems encountered is being followed up via a post mortem.
  • Performance issues on Castor Disk Servers for LHCb: This is being kept under observation. Investigations were suspended during the Castor 2.1.9 upgrade but will be resumed once LHCb re-start running batch work here.
  • As reported at previous meetings, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. The test is ongoing but some errors (roughly once/week) have been seen.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken. I had reported last week that the cause appeared to be related to temperature. However, further investigations suggest the cause is related to earth leakage detection. Two (of the four) transformers still require checking out as part of planned work following the first failure of TX2. One of these is planned to be done on 18th October.
  • A rolling update of the five nodes in the top-bdii is taking place today (6th October) to resolve a disk partitioning issue.

Declared in the GOC DB

  • Today (6th Oct) - At Risk on Top-BDII for rolling upgrade of the five nodes.
  • Monday 18th October - Site At Risk for checks on one of R89 transformers.
  • Wednesday 20th October - Site At Risk for UPS maintenance.

Advanced warning:

The following items remain to be scheduled/announced:

  • TBD. Outage on LFC/FTS/3D for kernel updates.
  • Tuesday 12th October. At Risk on Site BDII for updates.
  • Monday 13th December (just after LHC 2010 run ends): UPS test.
  • Castor Upgrade to 2.1.9.
    • Upgrade Gen (including ALICE) - during the week beginning 25 October
    • Upgrade CMS - during the week beginning 8 November
    • Upgrade ATLAS - during the week beginning 22 November

Entries in GOC DB starting between 29th September and 6th October 2010.

There was only one unscheduled entry in the GOC DB for this last week. This is for a problem on one of the CEs (CE01).

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgbdii.gridpp.rl.ac.uk (Top-BDII) SCHEDULED AT_RISK 06/10/2010 09:00 06/10/2010 17:00 8 hours At Risk while the five nodes in the top-bdii set are rebuilt one after the other to resolve a problem with disk partitioning.
lcgce01 UNSCHEDULED OUTAGE 06/10/2010 05:37 07/10/2010 13:00 1 day, 7 hours and 23 minutes System is not submitting jobs to batch system
lcgic01 (RGMA-IC), lcgmon01 SCHEDULED OUTAGE 01/10/2010 12:00 01/10/2010 13:45 1 hour and 45 minutes Outage of these nodes during a planned power outage to the building where they are situated.
lcgvo0428, lcgvo0599 (Old CMS VO boxes) SCHEDULED OUTAGE 01/10/2010 12:00 08/10/2010 12:00 7 days, Outage of these nodes for decommissioning. The downtime also includes a planned power outage to the building where they are situated.
srm-atlas SCHEDULED OUTAGE 30/09/2010 09:50 30/09/2010 10:30 40 minutes Short outage while some disk servers are rebooted to pick up latest kernels.
srm-lhcb SCHEDULED OUTAGE 27/09/2010 08:00 29/09/2010 12:20 2 days, 4 hours and 20 minutes Outage during Castor 2.1.9 upgrade for LHCb instance.