Tier1 Operations Report 2011-10-19

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 19th October 2011

Review of Issues during the week 12th to 19th October 2011.

  • The problem that has been reported regarding the 11kV feed into R89 has been fixed. As reported before a transparent intervention had been made on this. Since then tests have put load on the relevant section of the bus-bar and regular checks have been made to ensure the partial discharge (which can be heard with specialist equipment) has not returned. A recent inspection confirmed that everything was OK the problem is now deemed fixed.
  • This morning, Wednesday 19th Oct, there was a problem with one of the database nodes behind the LFC/FTS services. This caused a problem for the FTS service which was unavailable from around 10:30 to midday. An outage was declared in the GOC DB for the FTS service. There was only a transient (few minute) interruption to the LFC service which we can see from the log files.
  • WMS02 was out of production for most of the week for maintenance work to resolve a problem of its database growing too large.

Resolved Disk Server Issues

  • gdss353 (LHCbDst, D1T0) was out of production during the working day on Monday 17th Oct following a double disk failure.

Current operational status and issues.

  • Atlas report slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated. We are in contact with people in the US investigating a similar issue. The types of things being studied include disk server tcp/ip configurations.
  • We continue work with the Perfsonar network test system including understanding some anomalies seen.
  • A Post Mortem has been produced for the problems seen with the database behind the Atlas Castor instance of a a few weeks ago. This can be seen at:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20110927_Atlas_Castor_Outage_DB_Inconsistent

Ongoing Disk Server Issues

  • Thursday 13th Sep: FSPROBE reported a problem on gdss396 (CMSWanIn). This resulted in the loss of a single CMS file (since copied in from elsewhere). The server has crashed under test and this is being followed up.
  • gdss296 (CMSFarmRead) has been out of production since 20th August. This server has also crashed under test and this is being followed up.
  • gdss456 (AtlasDataDisk) failed with a read only file system on Wednesday 28th September. Following draining investigations are ongoing for this system.
  • On Thursday (29th Sep) FSPROBE reported a problem on gdss295 (CMSFarmRead). This server has been put into test and these are ongoing.

Notable Changes made this last week

  • A glite update has been applied to CE09.

Forthcoming Work & Interventions

  • Tuesday 1st November (TBC). Microcode updates for the tape libraries. No tape access from 09:00 to 13:00. (Delayed from Tuesday 18th Oct.)
  • Tuesday 1st November: There will be an intervention on the network link into RAL that will last up to an hour. This should not (in theory) affect our link. We plan to declare a site "Warning" and will drain out the FTS and pause batch work.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced:

  • There are also plans to move part of the cooling system onto the UPS supply that may require a complete outage (including systems on UPS).
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure. This will only proceed once the problem that caused the cancellation of the first stage of this work last week is understood and fixed.
  • Networking change required to extend range of addresses that route over the OPN.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
  • Replace hardware running Castor Head Nodes (aimed for end of year).

Entries in GOC DB starting between 12th and 19th October 2011.

There was one unscheduled outage in this last week. This was for the FTS following the problem with the database behind that service this morning.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgftm, lcgfts UNSCHEDULED OUTAGE 19/10/2011 10:30 19/10/2011 12:30 2 hours Problem with database behind FTS service. Now under investigation.
lcgwms02 SCHEDULED WARNING 13/10/2011 16:00 19/10/2011 12:45 5 days, 20 hours and 45 minutes Drain and MySQL maintenance
lcgce09 SCHEDULED WARNING 13/10/2011 14:00 18/10/2011 16:05 5 days, 2 hours and 5 minutes Draining and gLite update