Tier1 Operations Report 2010-04-28

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 28th April 2010.

Review of Issues during week 21st to 28th April 2010.

  • The Viglen 09 batch workers were brought into production last week. Initially jobs would only go to those nodes when the rest of the farm was full. This behaviour was changed to that these nodes are now used preferentially for batch work.
  • Over the weekend (24th) we failed some SAM tests. This was a more widespread problem caused by a problem on a BDII at CERN.
  • Over the weekend (25th) there was a problem on the Atlas Software server. One of the daemons (rpc.mountd) failed overnight. This was restarted on Sunday lunchtime.
  • On Monday 26th it was necessary to reboot disk server gdss145 (Babar). A disk was replaced in the system did not see the replacement until the reboot.
  • The first disk server built using Quattor to go into production went into use on Monday (26th).
  • During Monday, and particularly Tuesday there were DNS issues which caused problems for transfers to some UK Tier2s. The problem was limited to the lookup of some .uk addresses and the cause was believed to lie outside RAL. At the end of Tuesday afternoon an update was made to the disk servers to make use of different DNS servers at RAL that were not exhibiting this problem. The site was put "At Risk" overnight. The problem appeared fixed by Wednesday morning.

Current operational status and issues.

  • gdss290, part of CMSFarmRead, has been out of production since Monday. There were file system problems. One file on the server had not been migrated to tape and may have been lost - although this is not yet certain.

Declared in the GOC DB:

  • The drain and update of CE01 to enable glexec is in progress.
  • We are currently (Wednesday morning) in the middle of an outage for the following:
    • Re-balancing of SOMNUS database. (This operation brings the mirrored pairs of databases back into synchronization.)
    • Adding extra node into Castor SAN and configuration for the backup.
    • Add an extra switch into the network stack in the UPS room to increase capacity there.

Advanced warning:

The following items remain to be scheduled:

  • Kernel and glibc updates will need to be done on LFC, FTS & LHCb 3D Oracle database (RAC) nodes.
  • Addition of Castor 32-bit libraries onto worker nodes. (Possibly for implementation tomorrow, 29th April).
  • Advanced notice: Probably UPS test (implying site At Risk) during next LHC technical stop.

Entries in GOC DB starting between 21st and 28th April 2010.

There were three unscheduled entries in the GOC DB for this last fortnight.

  • The entry for the LFCs At Risk on 22nd April is a mistake. There was no At Risk. (Entry added with date incorrect in past and cannot be removed).
  • The reconfiguration of CE01 to enable mappings for GLEXEC has been planned. As this outage affects ALICE most directly they were consulted on when to intervene and suggested an immediate start.
  • The Site was put At Risk overnight 27/28 when we were seeing DNS issues when resolving some .uk addresses.
Service Scheduled? Outage/At Risk Start End Duration Reason
lfc-atlas.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, SCHEDULED AT_RISK 28/04/2010 10:30 28/04/2010 14:30 4 hours At Risk on LFCs during maintenance work on back-end databases and local network infrastructure.
lcgce01.gridpp.rl.ac.uk, lcgce02.gridpp.rl.ac.uk, lcgce05.gridpp.rl.ac.uk, lcgce06.gridpp.rl.ac.uk, lcgce07.gridpp.rl.ac.uk, lcgce08.gridpp.rl.ac.uk, srm-alice.gridpp.rl.ac.uk, srm-atlas.gridpp.rl.ac.uk, srm-cms.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-hone.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-lhcb.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk, SCHEDULED OUTAGE 28/04/2010 10:15 28/04/2010 14:30 4 hours and 15 minutes Castor and batch services stopped during work to extend internal network capacity and update infrastructure behind Castor databases.
lcgfts.gridpp.rl.ac.uk, SCHEDULED AT_RISK 28/04/2010 09:15 28/04/2010 14:30 5 hours and 15 minutes During this time Castor services at RAL will be unavailable. FTS channels to/from the RAL Tier1 will be drained and stopped. Other channels will continue to work.
Whole Site UNSCHEDULED AT_RISK 27/04/2010 16:44 28/04/2010 10:30 17 hours and 46 minutes We have declared an "At Risk" as we are encountering DNS problems that are believed to be external to the site. These are causing some problems for file transfers to other UK sites.
lcgce01.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 27/04/2010 16:04 29/04/2010 17:00 2 days, 56 minutes Reconfiguring the CE to enable mappings for glexec
lfc-atlas.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, UNSCHEDULED AT_RISK 22/04/2010 10:30 22/04/2010 14:30 4 hours At Risk on LFCs during maintenance work on back-end databases and local network infrastructure.