RAL Tier1 weekly operations castor 19/10/2009

From GridPP Wiki
Revision as of 08:37, 20 October 2009 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

  • CASTOR F2F at CERN (Chris, Matt)
  • Continuing to deal with fallout from ORACLE disk contoller crash: specificially the rollback of the databases (All)
    • Investigation into exactly what happened (All, DB Team)
    • Investigating into the consequences of re-using NS uniqueid (Chris, Matt, CERN team)
    • Producing lists of lost and at-risk files (Chris, Matt)
    • Gathering information for post mortem (All)
    • Increased NS uniqueid counter in NS database (All, DB Team)
  • Deployed one new disk server for LHCb (Chris)
  • Tweaked database backups to try out a grandfather/father/son cycle (Cheney)
  • Continued with build of new db server cdbe07 (Cheney)
  • Tweaked backups of redo logs to dmf for Pluto (Cheney)
  • Added bulk log disk array for Pluto redo log archive (Cheney)
  • Fixed cdbe02 and configured to pick up Overland array (Cheney)
  • Shifted emc array to run on different pdu but same power supply (Cheney)
  • Building tape robot controller to swapout buxton (Cheney)

Developments for this week

  • Setup 2.1.8 on repack server with Puppet (Chris)
  • Working on puppet manifest for polymorphic central servers (Chris)
  • Testing various combinations of emc kit versus power supply (Cheney)
  • Regen nagios config for diskservers (Cheney)
  • Build spare tape robot controller (Cheney)
  • Build replacement db server (Cheney)
  • Techwatch newsletter (Cheney)
  • Making ATLAS file lists for comparison to LFC (Matt)
  • Contributing to incident PMs (Matt)


Ongoing

  • SRM 2.8-1 deployment on Gen,LHCb,CMS (Shaun)
  • CastorMon monitoring graphs for Gen instance (Brian)
  • Black and White list tests (Chris)
  • Disaster recovery document (Matt)

Operations Issues

  • Possible lost data resulting from reusing NS uniqueid's (TBC).
  • Problems with DNS server (chiton) caused all CASTOR instances to be affected for 4-5 hours

Blocking issues

  • Problems with ganglia check on GEN instance delaying work on monitoring (in hand)

Planned, Scheduled and Cancelled Down Times

none

Changes to Production Milestones

none

Advanced Planning

  • Black and White lists? (delayed until it is required on a 'per-instance' basis)
  • Improve resiliency to central services (This year)

Staffing

  • Brian A/L
  • Tim at LTUG (Mon-Wed)
  • Shaun away (?)
  • Castor on Call person: Chris