Production Team Report 2010-10-25

From GridPP Wiki
Revision as of 14:02, 25 October 2010 by John kelly (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Production Team Report for 25th October 2010.

AoD This Week

Mon - Tue: John Wed: Gareth Thu - Fri: John

Last week

  • Gareth: AoD(1days), Some follow up with Post Mortem & DM issues.
  • John: Preparing for Castor GEN upgrade; re-writing and deploying the castorVoverwatch script.
  • Tiju: AoD(4days),

Changes to Operating procedures

  • New GOC DB interface. ("At Risk" state replaced by "Warning".)
  • Detailed changes when disabling batch worker nodes. (Ensure put in extended Nagios downtime)

Declared Outages in GOC DB

  • Mon - Wed 25-27 Oct: Castor GEN upgrade.
  • Tues 2nd Nov: "Warning" on MyProxy for migration to Quattorized service.

Advanced Warning

  • Remaining Castor upgrades almost certainly on following dates:
    • Upgrade CMS - during the week beginning 8 November
    • Upgrade ATLAS - during the week beginning 22 November
  • Monday 13th December - UPS test.

Other Changes

  • Fabric:
    • Upgrade of Disk servers to 64-bit OS (to resolve checksum problem).
    • Tape Drive Microcode Update.
    • Double the network link to the tape robot stack (stack 12), postponed from the last TS. (Requires Castor stop).
    • Swap out the older of the pair of SAN switches in the Tier1 Oracle databases for its new replacement. (Requires FTS, LFC, 3D stop).
    • Update firmware in RAID controller cards for a batch of disk servers.
  • Database:
    • Re-visit non-Castor database multipathing
  • Grid Services:
    • Replacing the MON box with glite-APEL
    • Upgrade site-level BDIIs to glite 3.2 and SL5
  • Castor:
    • Possible SRM update
  • Networks:
    • None
  • Atlas
    • Enable user jobs.