Production Team Report 2010-09-06

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Production Team Report for 16th August 2010.

AoD This Week

Mon: Tiju Tues: James T Wed: Gareth Thu & Fri: John

Last week

  • Gareth: AoD(1 Day), A/L (Fri)
  • John: Cern School
  • Tiju: AoD (2 days), Virtualisation

Leave

  • Gareth: Monday

Changes to Operating procedures

  • New callout system in production.

Declared Outages in GOC DB

  • Sept 7(08:30-17:00) : At Risk while switching over to a quattorised pair of site-level BDIIs.
  • Sept 1-9  : lcgwms01 - Maintenance and update (glite-WMS 3.1.29). Includes time for drain ahead of intervention.
  • Sept 9-16 : lcgwms02 - Maintenance and update (glite-WMS 3.1.29). Includes time for drain ahead of intervention.

Advanced Warning

  • September 7: Test of seal under 1st floor kitchen.(No access to kitchen)
  • Wednesday 8 Sept to Wed 15th Sept: Migrate Nagios checks for batch workers to new slave server
  • Weekend 2/3 October: Power outage in atlas building.
  • Update WNs (glite update)
  • Replace RAL site-level BDII servers
  • Update RAID controller firmware on all Streamline 2008 disk servers

Other Changes

  • Fabric:
    • Double the network link to the tape robot stack (stack 12), postponed from the last TS. (Requires Castor stop).
    • Swap out the older of the pair of SAN switches in the Tier1 Oracle databases for its new replacement. (Requires FTS, LFC, 3D stop).
    • New kernels and glibc updates on non-castor Oracle RAC nodes. (Done for LUGH).
    • Updates to amanda backup - unblocking possible other updates.
  • Database:
    • Re-visit non-Castor database mulitpathing
  • Grid Services:
    • None apart from those listed above.
  • Castor:
    • Possible SRM update
    • Castor 2.1.9 upgrade
  • Networks: