Tier1 Operations Report 2010-06-23

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 23th June 2010

Review of Issues during week from 16th to 23rd June 2010.

  • One of the WMSs (WMS02) was unavailable from Wednesday to Thursday (16 to 17 June) owing to a corrupt database.
  • GDSS390 (AtlasDataDisk) was out of production for a few hours on Friday 18th June while the fan was changed.
  • GDSS239 (AtlasHotDisk) was out of production for a few hours on Monday 21st June while faulty memory was replaced.

Current operational status and issues.

  • On Monday 24th May gdss207 (AliceTape) was removed from service due to possible file system corruption. System not yet back in service. It is being rebuilt.
  • On Thursday 3rd June gdss474 (LHCbMDst) failed with a read only file system. Work is ongoing to resolve hardware issues on this system.
  • GDSS539 (CMSFarmRead) has been unavailable since Friday 11th June. It was one of the newly deployed disk servers with SL5 & XFS that had a configuration problem.
  • GDSS67 (CMSFarmRead) had problems (FSPROBE errors) and was removed from service on morning of Thursday 20th May. The memory in this system was replaced but it had file systems problems. These were resolved and the server was returned to production on Friday (18th June). However, on Wednesday (23rd) it showed FSPROBE errors and was taken out of production again.
  • GDSS220 (LHCbDst) had memory problems and was taken out of service this morning (Wednesday 23rd June). It is expected back into production around the time of the meeting.
  • Dust in the Computer Room - particularly the HPD room: As reported previously this comes from lagging on cold water pipes which is abrading in the high airflow in some locations close to the CRAC (Computer Room Air Conditioning) units. A solution has been found (fitting a membrane around the lagging) and we are waiting for this work to be undertaken.
  • Outstanding problem with CMS tape migrations, being looked at by Castor & CMS people.

Declared in the GOC DB

  • Preventative maintenance work on transformers in R89 in two phases. In both cases an At Risk on the Tier1 site has been declared.
    • Monday - Wednesday 28-30 June. (Old date for LHC Technical Stop).
    • Monday - Thursday 19-22 July. (New dates for LHC Technical Stop).

Advanced warning:

The following items remain to be scheduled:

  • Closure of SL4 batch workers at RAL-LCG2 announced for the start of August.
  • Update to microcode on tape robots.
  • At Risk for re-connecting one of the power supplies to one of the EMC disk arrays (hosting the LFC, FTS and 3D databases) to UPS power.
  • Doubling of network link to network stack for tape robot and Castor head nodes.

Entries in GOC DB starting between 16th and 23rd June 2010.

There was one unscheduled outage during the last week, which was a problem on WMS02.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgwms02 UNSCHEDULED OUTAGE 16/06/2010 16:45 17/06/2010 09:39 16 hours and 54 minutes We are seeing some problems with job submission through this WMS to CREAM CEs. This is being investigated.
FTS, FTM SCHEDULED AT_RISK 16/06/2010 08:00 16/06/2010 12:00 4 hours At Risk during update to FTS version 2.2.4.