Tier1 Operations Report 2010-11-10

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 10th November 2010

Review of Issues during the week from 3rd to 10th November 2010.

  • On Thursday 4th November disk server GDSS463 (LHCbDst) was taken out of production for a few hours while some cabling was changed. This server had shown some problems with disks in a particular slot.
  • On Saturday (7th Nov) one of the pair of nodes behind the site-bdii crashed. This was re-started on Monday. However, over the weekend requests to the site-bdii were failing half the time. This node again failed on Tuesday and is now being checked out.
  • There was a problem with CE09 that failed SAM tests overnight Sunday-Monday (7-8 Nov). This was resolved during Monday morning.
  • There have been some problems perceived with the WMSs. These have in part been monitoring issues (e.g. the Steve Lloyd tests). Work is ongoing to clarify the expected response times for the WMS service (i.e. how long jobs can be expected to be in various internal states within the WMS).

Current operational status and issues.

  • On Monday 8th November Disk server gdss398 (AtlasDataDisk) failed with FSProbe errors. Attempts to 'fsck' the disks made it clear the file systems had major problems. All data on one of the three Castor partitions has been lost. Critical files are being manually copied off the other two partitions but it is very unstable and will not run Castor even in 'draining' mode. A Post Mortem is being written for this data loss incident.
  • During the night 26-27 October disk server GDSS117 (CMSWanIn) failed with a read only filesystem. It was removed from production. Since then it has been re-running the acceptance tests before being returned to production.
  • Gdss280 (CMSFarmRead) which had reported FSProbe errors twice, is still out of production. It is now being replaced by Gdss289 (from cmsSpare). Once replaced, Gdss280 will be removed from production, and as such will no longer be flagged as being in intervention.
  • The problem with the cooling of one of the power supplies on the tape robot was investigated during a downtime of the tape system on 2nd Nov. A further intervention will be required to fix the problem.
  • Performance issues on Castor Disk Servers for LHCb: This is being kept under observation. This problem has been superseded at the moment by the issues of the throughput of the LHCb SRMs
  • Testing of an EMC disk array with one of its power supplies connected to the UPS supply continues. Further discussions on removing the electrical noise have taken place and a solution is being prepared.
  • Transformer TX2 in R89 is still out of use. As a result of work carried out on 18th Oct. on TX4 the indication is that the cause of the TX2 problem relates to over sensitive earth leakage detection. Plans are being made to resolve this.
  • Atlas are now running some user jobs at RAL. In order to ensure these do not cause problems (i.e. excessive load) for the existing Atlas software server, these jobs make use of a pilot CVMFS based solution for delivering Atlas software to the worker nodes. This did expose a permissions problem whereby any Atlas User has rather too open an access to all Atlas data. Plans have been made to address this.

Declared in the GOC DB

  • Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem. Wednesday 10th Nov.
  • Upgrade to the LHCb SRMs to provide greater capacity. At Risk (Warning) on srm-lhcb Thursday 11th Nov.
  • Upgrade CMS castor instance - Tuesday to Thursday 16-18 November.

Advanced warning:

The following items remain to be scheduled/announced:

  • Tuesday 23rd November (TBC) Intervention on tape robot for the problem of power to the cooling fans.
  • Upgrade castor Atlas instance to version 2.1.9 - Monday to Wednesday 6 - 8 December.
  • Power outage of Atlas building weekend 11/12 December. Whole Tier1 at risk.
  • Monday 13th December (just after LHC 2010 run ends): UPS test.
  • Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem.

Entries in GOC DB starting between 3rd and 10th November 2010.

There were no unscheduled entries in the GOC DB for this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-lhcb SCHEDULED OUTAGE 10/11/2010 08:00 10/11/2010 16:00 8 hours Service unavailable while disk servers upgraded to 64-bit OS.
site-bdii SCHEDULED WARNING 03/11/2010 09:00 03/11/2010 13:00 4 hours Rolling update to glite 3.2 and SL5.