Tier1 Operations Report 2010-06-30

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 30th June 2010

Review of Issues during week from 23rd to 30th June 2010.

  • Problem with CMS tape migrations has been resolved by creating a new D1T0 service class called CMSTemp within Castor.
  • An "At Risk" on the Tier1 site took place from Monday to today (28-30 June) for maintenance work on transformers.
  • One power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services has been moved to UPS power as a test if the electrical noise has been reduced sufficiently.
  • GDSS220 (LHCbDst) had memory problems and was out of service for a while during last Wednesday (23rd June).
  • GDSS539 (CMSFarmRead) was unavailable from Friday 11th June. It was returned to production on the 23rd June and closely monitored. It was one of a handful of newly deployed disk servers with SL5 & XFS that had a configuration problem.
  • On Friday (25th) a problem was identified and fixed which prevented Alice from writing data into Castor via SRM.
  • Failing transfers to/from AtlasScratchDisk due to a problem with gdss547 on Thursday (24th June). Fixed by restarting daemons.
  • On Monday 24th May gdss207 (AliceTape) was removed from service due to possible file system corruption. This system was returned to service just before this meeting (30th June).

Current operational status and issues.

  • On Thursday 3rd June gdss474 (LHCbMDst) failed with a read only file system. Work is ongoing to resolve hardware issues on this system.
  • GDSS67 (CMSFarmRead) had problems (FSPROBE errors) and was removed from service on morning of Thursday 20th May. The memory in this system was replaced but it had file systems problems. These were resolved and the server was returned to production on Friday (18th June). However, on Wednesday (23rd) it showed FSPROBE errors and was taken out of production again. It has been worked on and is undergoing acceptance tests again.
  • Dust in the Computer Room - particularly the HPD room: Still awaiting the fitting of a membrane around the lagging to resolve the issue.
  • Slow file transfers between RAL and SARA are being investigated. A timeout in FTS has been increased and the network 'WAN' tuning on disk servers is being updated.

Declared in the GOC DB

  • Preventative maintenance work on transformers in R89 in two phases. The first has been completed. For the second an "At Risk" on the Tier1 site has been declared.
    • Monday - Thursday 19-22 July. (New dates for LHC Technical Stop).

Advanced warning:

The following items remain to be scheduled:

  • Closure of SL4 batch workers at RAL-LCG2 announced for the start of August.
  • Update to microcode on tape robots scheduled for Tuesday 20th July (during LHC technical stop).
  • Doubling of network link to network stack for tape robot and Castor head nodes.

Entries in GOC DB starting between 23rd and 30th June 2010.

There were no unscheduled outages during the last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site SCHEDULED AT_RISK 28/06/2010 08:30 30/06/2010 17:00 2 days, 8 hours and 30 minutes At Risk for site during maintenance work on electrical supply (transformers).