Tier1 Operations Report 2010-02-17

From GridPP Wiki
Revision as of 13:25, 17 February 2010 by John kelly (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 17th February 2010.

Review of Issues during week 10th to 17th February 2010.

  • Disk server GDSS172 (lhcbUser D0T0) was unavailable during Thursday 11th February following a failure to rebuild the RAID array after a disk failure.
  • Two Disk servers (gdss401 and gdss412, both part of Atlas MCDISK) were out of production on Friday 12th owing to network issues relating to just those nodes.
  • Some problems with failing SAM tests over the weekend, possibly due to high network traffic and CMS not using lazy downloads.
  • Some problems with Atlas data transfers over the weekend (GGUS 55526). Traced to LSF problems.
  • CMS had a very high migration queue on Friday and again today.
  • Yesterday (the 16th) a node in the pluto RAC ran out of memory and caused a disruption to the gen and CMS castor instances.
  • A node rebooted in the pluto RAC this morning. This caused a 10 minute failure while Oracle reconfigured the rack.
  • Problems with lcgce02 failing SAM tests. This is currently under investigation.
  • There has been an At Risk period on Monday and another today while additional memory is being added to the remaining castor Oracle RAC nodes.

Current operational status and issues.

  • The Castor system remains running with less resilience than hoped, as reported last week. Work is ongoing to improve resilience where practical, such as completing the memory upgrades in the Castor Oracle RAC nodes. This work both reduces any problems with memory bottlenecks and gives more capacity to fail-over services should that be necessary.
  • In response to a request for Atlas the memory limit on the '3GB' batch queue was increased to 4GB to enable their reprocessing to take place. The memory over-commit was not increased, so this could result in a loss of job slots.
  • On 31st December a recurrence of the Castor/Oracle 'BigID' problem was seen. This is still under investigation.
  • There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is still under investigation.

Advanced warning:

The following has been scheduled in the GOC DB:

  • Castor 'AT RISK' on Tuesday 23th Feb while database NFS mounts are reconfigured.

The following items remain to be scheduled:

  • Investigations into the lack of resilience of the Castor Oracle infrastructure may produce a requirement for an intervention. In addition the following changes in this system have not yet been carried out.
    • reconfiguring NFS mounts (now scheduled for Tues 23th).
    • removal of unstable node from RAC.
    • insertion of new node into RAC.
  • Adding the second CPU back into the UKLIGHT router.

Entries in GOC DB starting between 10th and 17th February 2010.

One UNSCHEDULED outage

  • The whole site At Risk for the coolant pumps on The 11th/12th was late going into the GOCDB.
Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor Instances SCHEDULED AT_RISK 17/02/2010 10:30 17/02/2010 16:00 5 hours and 30 minutes At Risk on castor during memory update on Oracle RAC nodes. This has been moved from Tuesday because of staffing constraints.
All Castor Instances SCHEDULED AT_RISK 15/02/2010 10:30 15/02/2010 16:00 5 hours and 30 minutes At Risk on castor during memory update on Oracle RAC nodes.
Whole site UNSCHEDULED AT_RISK 11/02/2010 09:00 12/02/2010 09:02 1 day, 2 minutes Site at risk while we upgrade the air conditioning. We will be adding two extra coolant pumps to the air conditioning system. This 'at risk' is to cover the running in period of the new pumps.