Tier1 Operations Report 2010-12-08

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 8th December 2010

Review of Issues during the week from 1st to 8th December 2010.

  • At 13:30 on Wednesday 1st December, just as last week's meeting was about to start, there was a brief power interruption across the RAL site. A number of Tier1 systems (disk servers and batch workers) went down. The whole site was declared as being in an outage while an assessment was made. Grid services (i.e. not batch or Castor) stayed up as they have UPS backup. With the exception of LHCb Castor (SRM) endpoints were declared down overnight as there were numerous disk servers not yet available - they were checking (fsck) their file systems. All LHCb disk servers were up by the end of the afternoon and srm-lhcb was declared up overnight.
  • On Friday (December 3rd) there was a problem with the 'handbots' within the tape robot which was effectively down. An engineer was called during the night and attended site very early that morning.
  • There was a problem with the CMS SRM of the weekend producing core dumps. On Monday (6th) the nodes been upgraded to SRM version 2.8-6 which should correct the problem.
  • On Monday we announced (via a broadcast) that we had a problem with our callout (pager) mechanism. This was resolved (by the phone supplier) early that evening.
  • Also on there were problems with CMS connectivity 05:20 to 05:40 and again 21:55 to 22:20 on Monday. Note added after meeting: Whilst the exact reason for this is unknown these problems co-incide with external network failovers.
  • There was a networking outage that affected the whole of the RAL site during the morning of Tuesday 7th December (06:30 - 12:15) caused by a problem with the Site Access Router. There was some difficulty in getting the message out that the RAL Tier1 was off-air. Once the network connectivity was re-established there were some resultant issues (e.g. DNS lookups and BDII information not available) and it took a little while (of the order of one to two hours) for all services to be fully functional.
  • The FTM (FTS monitor) was successfully updated to a Quattorised installation on Tuesday (7th Dec.)
  • The Atlas Castor Upgrade has been completed. The outage was ended in the GOC DB at 12:15 today (8th Dec.)

Current operational status and issues.

  • One disk server (gdss77 - CMSFarmRead) has been out of production since the power outage. It requires the OS to be re-installed.
  • Problems with a particular batch of disk servers: This has been responsible for a significant number of disk server failures, including the two Atlas ones during the last few weeks. This batch of disk servers will be removed from service. The recent availability of the large batch from the '09 purchase makes this possible. It is anticipated this will take place over the next week or two. In the meantime the problematic servers have been marked read-only. For D1T0 servers this has been done by setting the scheduling within Castor to not allocate writes to the relevant disk servers. Files on these servers are still available (for batch work and via FTS). (This item unchanged since last week).
  • Performance of LHCb disk servers continues to be monitored for performance. The maximum number of LHCb batch jobs has been held at 1200 for this last week.
  • During the night 26-27 October disk server GDSS117 (CMSWanIn) failed with a read only file system. It was removed from production. This server will be replaced with a spare.
  • Transformer TX2 in R89 is still out of use.

Declared in the GOC DB

  • "Outage" for upgrade of Atlas Castor instance - Monday to Wednesday 6-8 December.
  • Power outage of Atlas building weekend 11/12 December. Whole Tier1 at risk.
  • Monday 13th December: UPS test.

Advanced warning:

The following items remain to be scheduled/announced:

  • Migration of data off batch of disk servers that have been giving problems.
  • Application of kernel update to batch server (At Risk).
  • Rolling update of microcode on second half of tape drives.
  • Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem.
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change)
  • Address permissions problem regarding Atlas User access to all Atlas data.

Entries in GOC DB starting between 1st and 8th December 2010.

There was five unscheduled entries in the GOC DB for this last week. Three were as a result of the power interruption. One was to modify the power for the disk arrays holding the Oracle databases and one for the site network outage.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgftm SCHEDULED WARNING 07/12/2010 10:00 07/12/2010 12:00 2 hours Moving to a Quattor installation for FTM.
Whole site UNSCHEDULED OUTAGE 07/12/2010 06:30 07/12/2010 12:30 6 hours Site outage owing to network fault.
srm-atlas SCHEDULED OUTAGE 06/12/2010 08:00 08/12/2010 18:00 2 days, 10 hours Upgrade of Atlas Castor instance to version 2.1.9
LFCs, FTS & Castor (all SRM endpoints) UNSCHEDULED WARNING 02/12/2010 13:00 02/12/2010 14:00 1 hour Warning on some services during minor reconfiguration of power supply to database hardware.
All Caster except LHCb. (srm-alice, srm-atlas, srm-cms, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-superb, srm-t2k.gridpp.rl.ac.uk) UNSCHEDULED OUTAGE 01/12/2010 16:10 02/12/2010 10:00 17 hours and 50 minutes Following the short power interruption earlier today many disk servers are checking their file systems (fsck).
srm-lhcb UNSCHEDULED WARNING 01/12/2010 16:10 02/12/2010 10:00 17 hours and 50 minutes Following the short power interruption earlier today all LHCb disk servers are up, but we are still making checks.
Whole site UNSCHEDULED OUTAGE 01/12/2010 13:30 01/12/2010 16:20 2 hours and 50 minutes Power glitch at RAL