Difference between revisions of "Production Team Report 2010-12-06"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 15:40, 6 December 2010

RAL Tier1 Production Team Report for 06th December 2010.

AoD This Week

Mon - Wed: John Thu: Gareth Fri: John

Last week

  • Gareth: GoD and AoD for 1 day.
  • John: Deploying Nagios test for read-only file systems and fsprobe test.
  • Tiju: AoD - 4 days.

Changes to Operating procedures

  • Call-outs being added for FSPROBE errors and read-only file systems.

Declared Outages in GOC DB

  • Mon 6th - Wed 8th Dec upgrade of Atlas Castor instance.
  • Tue 7th At risk for FTM as it is being quattorized.
  • 10th to 13th - Next weekend there is a power outage in the Atlas building.

Advanced Warning

  • Weekend 11/12 Dec: Power outage in Atlas building.
  • Monday 13th December - UPS test.

Other Changes

  • Fabric:
    • Removal of sl08 disk servers from production (with castor team)
    • Double the network link to the tape robot stack (stack 12), postponed from the last TS. (Requires Castor stop).
    • Swap out the older of the pair of SAN switches in the Tier1 Oracle databases for its new replacement. (Requires FTS, LFC, 3D stop).
  • Database:
    • Re-visit non-Castor database multipathing
    • Increase shared memory for OGMA, LUGH & SOMNUS
  • Grid Services:
    • Changes to increase resilience of the BDII service
  • Castor:
    • Change ATLAS castor permissions to prevent users deleting data
    • Removal of sl08 disk servers from production
  • Networks:
    • None