Tier1 Operations Report 2011-01-12

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 12th January 2011

Review of Issues during the week from 5th to 12th January 2011.

  • Wednesday 5th - Brief network outage for around ten minutes in late afternoon (4pm)
  • Thursday 6th - GDSS496 (CMSFarmRead) taken out of production. There are problems migrating a couple of files.
  • Saturday 8th - Problems with site BDII - resulted in some (3 hour) site unavailability for OPS.
  • Monday 10th - Read only file system reported on GDSS327 (AtlasFarm – D0T1). Taken out of production
  • Monday 10th - LHCb - increase max jobs from 1200 up to 1500.
  • Monday 10th - The draining of the last of the faulty batch of disk servers has been completed. All these servers are out of production.
  • Monday 10th - Following the PDU problems reported last week it has been necessary to "rebalance" (re-synchronize) the Oracle database behind the 3D, FTS and LFC services. This exercise has been completed.
  • Monday 10th - Problem with tape migration for LHCb. Some files have bad checksums. We think this is due to failed transfers of the files into Castor.
  • Monday 10th - Draining of faulty batch of disk servers has been completed. Atlas FTS channels up to 75% of nominal, batch limits raised.
  • Tuesday 11th - Increased Atlas FTS channels back up to 100%
  • Tuesday 11th - 9am - short network break (10 minutes) around 09:00.
  • Tuesday 11th - Disk server GDSS364 (CMSTemp) returned to production. It had been out of service since it crashed on 1st January. Following tests faulty memory was replaced.
  • Tuesday 11th - High load on the Castor GEN instance SRM from T2K. Resolved by temporarily banning the user while they were contacted. User now un-banned.

Current operational status and issues.

  • On Monday (10th January) there was a problem with the Atlas Castor instance that lasted about 30 minutes. The Castor Job Manager has been seen to periodically hang up. A workaround of running two Job Managers is in place (for all Castor instances), but the problem is not yet understood.
  • On Thursday 23rd December GDSS337 (GenTape) failed. There was only one un-migrated file (for T2K) on it. This server is still out of production and awaiting replacement memory.
  • Saturday 25th December GDSS283 (cmsFarmRead) reported file system and fsprobe errors. It was removed from production. Investigations into possible hardware fault ongoing. There is a possibility of a file system being damaged on this server which had a small number (around 20) un-migrated files on it when it failed.
  • Last week we reported a problem with disk servers becoming unresponsive. There have been no cases of this during this last week. Work is ongoing (tests on the pre-prod instance) to understand this failure.
  • Transformer TX2 in R89 is still out of use.

Declared in the GOC DB

  • 17/18 January: Upgrade to 64-bit OS on Castor disk servers for Atlas.
  • 22/23 January: Weekend power outage in Atlas building ("At Risk").

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Application of kernel update to batch server (some small risk to batch services).
  • Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem. Possible dates for this are:
    • CMS - Late January or Early February 2011.
    • GEN - To Be Decided.
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change). Tuesday morning 18th January.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Upgrade all Oracle databases from version 10.2.0.4 to 10.2.0.5 (assuming this upgrade goes OK at CERN).
  • Detailed changes to batch configuration to enable scheduling by node.
  • Network (VLAN) reconfiguration to make more addresses available to the Tier1.

Entries in GOC DB starting between 5th and 12th January 2011.

There were two unscheduled entries, both "At Risks", one for the re-balancing of the "somnus" Oracle database, the other for updates to the site BDII.

Service Scheduled? Outage/At Risk Start End Duration Reason
site-bdii UNSCHEDULED AT_RISK 12/01/2011 10:00 12/01/2011 13:00 3 hours At Risk during application of system updates.
lcgftm, lcgfts, lfc-atlas.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk UNSCHEDULED AT_RISK 10/01/2011 11:30 10/01/2011 15:30 4 hours At Risk while database behind these services is rebalanced (re-synchronised) across disk servers.