Tier1 Operations Report 2010-08-11

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 11th August 2010

Review of Issues during the week from 4th to 11th August 2010.

  • gdss475 (LHCbUser) failed with a suspected hardware fault on 4th August. After checking and no fault found it was returned to service the next day (5th August).
  • GDSS472 (LHCbMDst) failed on Saturday morning (7th August). Fabric on-call came on site. After checking the system over it was returned to production that afternoon.
  • A problem at Oxford led to many spurious SAM test failures for the Tier1 (in common with other UK sites) over the weekend.
  • Dust in the Computer Room - Remedial work on lagging pipes is almost complete (should be finished this week). Once completed some cleaning will be carried out. Monitored dust levels have dropped.

Current operational status and issues.

  • Gdss417 (Atlas MCDisk) failed early on 1st August. This has been reported to Atlas as a Data Loss. Last week, using further information received, it was possible to recover the data and re-establish the RAID array. However, when the server was returned to production it became unstable. Important files were copied off the server while it was not in production and re-inserted in Castor. A total of around 4,000 files (including the ones Atlas described as important) were copied back into Castor this way. A further attempt to drain the server resulted in it crashing. Finally 43,000 files have been declared as lost to Atlas. A Post Mortem is being produced.
  • Gdss381 (CMSTemp) failed on Monday (9th August). The file system was read-only. Since then it has shown FSProbe errors. Investigations ongoing.
  • As reported at previous meetings, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. The test is ongoing but some errors (roughly once/week) have been seen.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken. The cause of the trip appears to be related to temperature - but it is not clear if this was a real over-temperature or a sensor problem. Investigations are ongoing.
  • There has been some discussion over the batch scheduling as regards long jobs (Alice in this case) filling the farm during a 'quiet' period and preventing a other VO's jobs starting until the jobs have finished. We are looking if this could be improved.

Declared in the GOC DB

  • Thursday 12th August - Thursday 19 August lcgwms03 - downtime for draining and glite update.
  • Tuesday 17th August - Top-BDII - At Risk for glite update.

Advanced warning:

The following items remain to be scheduled:

  • lcgwms01 (4-day drain period + upgrade to glite-WMS 3.1.29-0) - Fri 2 Sep to Thu 9 Sep
  • lcgwms02 (4-day drain period + upgrade to glite-WMS 3.1.29-0) - Fri 10 Sep to Thu 16 Sep
  • Weekend Power Outage in Atlas building - late September or early October (TBD). Currently planning the necessary moves to ensure continuity for any services still using equipment in the Atlas building.
  • Doubling of network link to network stack for tape robot and Castor head nodes.
  • Re-visit the SAN / multipath issue for the non-castor databases.
  • Update firmware in RAID controller cards for a batch of disk servers.

Entries in GOC DB starting between 4th and 11th August 2010.

There were no entries in the GOC DB for this last week.