Tier1 Operations Report 2010-09-29

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 29th September 2010

Review of Issues during the week from 22nd to 29th September 2010.

  • On Tuesday 21st September disk server GDSS364 (CMS Temp) was found to have a hardware problem. It was returned to production in draining (read-only) mode at the end of that afternoon pending further diagnostics. It was then put back in full production on Thursday 23rd.
  • On Sunday evening GDSS405 failed. This was checked on Monday morning. Some memory was replaced and the server returned to production in 'draining' mode on Monday. It was put back in full production on Tuesday. Atlas reported problems while the server was in 'draining' mode with FTS file transfers failing.
  • There was a problem with very high load on CMS during the night Monday/Tuesday (27/28 Sep.) causing transfer failures.

Current operational status and issues.

  • Gdss280 (CMSFarmRead) had showed FSProbe errors and was taken out of production on Thursday 19th August. As reported last week this server was returned to production on the morning of 15 Sep). The server again gave FSPROBE errors and was taken back out of production the next day (16th). 30 un-migrated files were lost. A review of the problems encountered is being followed up via a post mortem.
  • Problem using globus-url-copy doesn't work from CERN on to SL5 64bit disk servers at RAL. A fix (downgrading globus-gridftp-server from 1.10.1 to 1.8.1) is being tested on several servers.
  • Performance issues on Castor Disk Servers for LHCb: This is being kept under observation. Investigations have been suspended during the Castor 2.1.9 upgrade but will be resumed in due course.
  • We are aware of and taking appropriate action regarding a recent Security alert. Many systems patched. Access to UIs was restricted for some time.
  • As reported at previous meetings, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. The test is ongoing but some errors (roughly once/week) have been seen.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken. I had reported last week that the cause appeared to be related to temperature. However, further investigations suggest the cause is related to earth leakage detection. Two (of the four) transformers still require checking out as part of planned work following the first failure of TX2. One of these is planned to be done on 18th October.
  • The upgrade of the LHCb Castor instance to version 2.1.9 is ongoing. User (VO) testing is underway.

Declared in the GOC DB

  • Mon 27 - Wed 29 Sep: (Ongoing at time of meeting). Castor upgrade on LHCb.
  • Thursday 30th - Short outage on srm-atlas for some disk server reboots.
    • Will also be some CMS servers to be rebooted. In both cases batch jobs will be paused around the time of the reboots.
  • Friday 1st Oct (midday) to Monday 4th Oct (midday) outage on lcgic01 (RGMA) and our 'mon' box. (Power outage in Atlas building.)
  • Saturday 2nd Oct - Monday 4th October: At Risk on site. (Power outage in Atlas building.)
    • A small number of additional services not in the GOC DB down _e.g. BaBar SQL server plus some services internal to the Tier1.

Advanced warning:

The following items remain to be scheduled/announced:

  • Both the following were scheduled in a LHC technical stop until the dates for that were moved.
    • Wednesday 20th October (during LHC technical stop): UPS maintenance.
    • Monday 18th October checks on one of R89 transformers.
  • Monday 13th December (just after LHC 2010 run ends): UPS test.
  • Castor Upgrade to 2.1.9.
    • Upgrade Gen (including ALICE) - during the week beginning 25 October
    • Upgrade CMS - during the week beginning 8 November
    • Upgrade ATLAS - during the week beginning 22 November

Entries in GOC DB starting between 22nd and 29th September 2010.

There were no unscheduled entries in the GOC DB for this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-lhcb SCHEDULED OUTAGE 27/09/2010 08:00 29/09/2010 18:00 2 days, 10 hours Outage during Castor 2.1.9 upgrade for LHCb instance.