Difference between revisions of "Tier1 Operations Report 2011-03-09"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 13:26, 16 March 2011


RAL Tier1 Operations Report for 9th March 2011

Review of Issues during the week from 3rd to 9th March 2011.

  • Thursday 3rd March: Backup OPN link to CERN failed. No operational impact.
  • checksum script found checksum mismatches on various disk servers.
  • Changes:
    • Applying updates and patches to batch farm on a rolling basis and other machines as and when it is possible.

Current operational status and issues.

  • Atlas have reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). This is being investigated.
  • We are aware of a problem with the Castor Job manager that can occasionally hang up. The service recovers by itself after around 30 minutes. An automated capture of more diagnostic information is in place and we still await the next occurrence.
  • We have previously reported a problem with disk servers becoming unresponsive. We believe we understand the problem and are testing the fix which has been rolled out to all CMS disk servers. This problem (and fix) is still being monitored.

Declared in the GOC DB

  • Tuesday 8th March: Downtime of batch farm for Castor nameserver upgrade.
  • Wednesday 9th March: Downtime for upgrade of Castor nameserver
  • Thursday 10th March. At Risk for glite update on Site BDIIs
  • Thursday 10th March to Friday 11th March (15:00) At risk on Atlas lfc while renaming files in bulk.

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Updates to Site Routers on morning of Tuesday 15th March. (Will require Tier outage).
  • Castor upgrades (possibly directly to 2.1.10) - Possibly late March (27/28/29 during technical stop.)
  • Switch Castor to new Database Infrastructure - Possibly end March.
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change). (OGMA done).
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Small isolating transformers have been received that should eliminate the electrical noise seen by the disk arrays hosting the Oracle databases. Plan to install for one of Castor systems on 8th March (during castor stop - now done), with others during networking intervention on 15th March.
  • Atlas requested changes pending:
    • Renaming Atlas files (that were in MCDisk)
    • Adding xrootd libraries to worker nodes

Entries in GOC DB starting between 2nd March and 9th March 2011.

There were no unscheduled entries this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-alice.gridpp.rl.ac.uk, srm-atlas.gridpp.rl.ac.uk, srm-cert.gridpp.rl.ac.uk, srm-cms.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-hone.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-lhcb.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-preprod.gridpp.rl.ac.uk, srm-superb.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk, SCHEDULED OUTAGE 09/03/2011 08:00 09/03/2011 15:00 7 hours Update to Castor Nameserver
lcgce02.gridpp.rl.ac.uk, lcgce03.gridpp.rl.ac.uk, lcgce05.gridpp.rl.ac.uk, lcgce06.gridpp.rl.ac.uk, lcgce07.gridpp.rl.ac.uk, lcgce08.gridpp.rl.ac.uk, lcgce09.gridpp.rl.ac.uk SCHEDULED OUTAGE 08/03/2011 20:00 09/03/2011 15:00 19 hours Batch stop for Castor Nameserver update.
lcgic01.gridpp.rl.ac.uk, SCHEDULED OUTAGE 02/03/2011 12:00 01/04/2011 01:00 29 days, 12 hours RGMA registry to be decommissioned off