Production Team Report 2011-01-17

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Production Team Report for 17th January 2011.

AoD This Week

Mon: Gareth Tue - Thu: Tiju Fri: Gareth

Last week

  • Gareth: AoD (2 days); Post Mortems on Disk servers,
  • John: A/L
  • Tiju: AoD (3 days); Set-up trial of SMS notifications in parallel with pager.

Changes to Operating procedures

  • Trial of SMS notifications in parallel with pager.

Declared Outages in GOC DB

  • Mon/Tues 17/18 January - 64-bit OS on Atlas Castor disk servers.
  • Sat/Sun 22nd January - Power outage in Atlas building.

Advanced Warning

  • Tuesday 18th January - Increase shared memory for OGMA.
  • Wednesday 19th January - glite updates to site BDIIs.

Other Changes

  • Fabric:
    • Application of kernel update to batch server.
    • Addition of additional gateway address to enable additional IP range.
    • Double the network link to the tape robot stack (stack 12). (Requires Castor stop - probably do when Oracle updates applied).
    • Swap out the older of the pair of SAN switches in the Tier1 Oracle databases for its new replacement. (Requires FTS, LFC, 3D stop).
  • Database:
    • Oracle 10.2.0.5 upgrade. (Will do after CERN has done updates to like databases).
    • Re-visit non-Castor database multipathing
    • Increase shared memory for OGMA, LUGH & SOMNUS (proposed for Tuesday 18th Jan).
  • Grid Services:
    • Changes to increase resilience of the BDII service
    • glite update on site BDII nodes.
    • Change batch Job OS selection mechanism (part of enable scheduling by node).
  • Castor:
    • Upgrade CMS Disk Servers to 64-bit - probably Mon/Tue 31st Jan/1st Feb.
    • Upgrade GEN Disk Servers to 64-bit - date TBD.
    • Change ATLAS castor permissions to prevent users deleting data
    • Upgrade Puppetmaster & Clients
  • Networks:
    • None