Tier1 Operations Report 2010-12-22

From GridPP Wiki
Jump to: navigation, search


RAL Tier1 Operations Report for 22nd December 2010

Review of Issues during the week from 15th to 22nd December 2010.

  • Intermittent problems has been seen by CMS in the connectivity of worker nodes with their squids were reported last week. The cause has been traced to a configuration which required that each file access needs a DNS lookup of the name of the CERN peered server. This has now been corrected.
  • An Alice tape broke. It had some 12,000 files on it, of which around 4,000 were recovered from the staging disk. Alice have been informed.
  • Thursday 16th - Late afternoon we failed a LHCb VO SAM test on srm-lhcb. The cause was the Castor JobManager which had stopped responding (although the process was running fine.)
  • Thursday 16th - A failed transceiver was found (and replaced) on one of the pair of uplinks from switch stack 13. For some time we had been running with only one of the pair of links working.
  • Friday 17th Dec. One call-out for gdss70 (LHCbMDST) failed with file system errors. There were no un-migrated files on the server and it remains out of production for investigation.
  • Friday 17th Dec. Overnight we saw a very high rate of access to Castor by Atlas and we failed a lot of transfers with the Castor scheduler (LSF) struggling. Atlas blacklisted the UK cloud. The number of FTS transfers and total batch jobs were throttled back and the situation was recovered.
  • GDSS357 (CMSFarmRead) failed overnight Saturday/Sunday (18/19 December). The hardware fault is being followed up with the vendor.
  • The rolling update of microcode on the tape drives was completed (second half done Tuesday 21st December)

Current operational status and issues.

  • Problems with a particular batch of disk servers that has been responsible for a significant number of disk server failures: These servers are being drained and taken out of use. This process is well underway and expected to compete within the next week or so.
  • Transformer TX2 in R89 is still out of use.

Declared in the GOC DB

  • 17/18 January: Upgrade to 64-bit OS on Castor disk servers for Atlas.

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Application of kernel update to batch server (some small risk to batch services).
  • Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem. Possible dates for this are:
    • CMS - Late January or Early February 2011.
  • 22/23 January. Weekend power outage in Atlas building (leading to "At Risk").
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change).
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Upgrade all Oracle databases from version 10.2.0.4 to 10.2.0.5 (assuming this upgrade goes OK at CERN).

Plans for Christmas and New Year Holidays

The Tier1 will remain up over the holiday. Details of the plans have been published via a blog entry at:

http://www.gridpp.rl.ac.uk/blog/2010/12/10/ral-tier1-%E2%80%93-plans-for-christmas-holiday/

Entries in GOC DB starting between 15th and 22nd December 2010.

  • None.