RAL Tier1 weekly operations castor 21/09/2009

From GridPP Wiki
Revision as of 14:43, 21 September 2009 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

  • NS 2.1.8-3 upgrade & testing (Chris, All)
  • SRM 2.8 upgrade on CMS (Shaun, DB Team)
  • Implementing database performance tuning (DB Team)
  • Updating database kernels (Cheney)
  • Dealing with D2D Transfer incident following NS upgrade (All)
  • Investigating distributing Raid5/6 servers across service classes (Brian)
  • Installation of new CASTOR servers (Tim)
  • T10KB tape deployment and hardware plans (Tim/Matt)
  • Strategic plans updates (Matt)
  • Preprod plans (Matt, Chris, Richard)

Developments for this week

  • SRM 2.8 upgrade on ATLAS (Shaun, DB Team)
  • Finalizing testing for CIP 2.0 (Jens)
  • Investigating cause of D2D Transfer incident (Chris)
  • Preparing disk server deploymentation documentation (Chris)
  • Investigating distributing Raid5/6 servers across service classes (Brian)
  • Investigating cause of DB hardware problems (Cheney)
  • Acceptance testing new CASTOR servers (Richard)
  • Chasing up strategic objectives (Matt)
  • Disaster recovery documentation (Matt)

Ongoing

  • CastorMon monitoring graphs for Gen instance (Brian)
  • Setting up Preproduction (Richard, Chris)

Operations Issues

  • D2D Transfer incident following NS upgrade affecting all instances

Blocking issues

  • Problems with ganglia check on GEN instance delaying work on monitoring (in hand)

Planned, Scheduled and Cancelled Down Times

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Upgrade ATLAS SRM to 2.8 21/9/09 1000 21/9/09 1200 Downtime ATLAS
Oracle patch to prevent reoccurrence of recent hardware problem. 21/9/09 1200 21/9/09 1400 At risk All instances
Suspend CASTOR during R89 UPS test 22/9/09 0800 22/9/09 1000 Downtime All
CIP 2.0 upgrade 29/9/09 1200 29/9/09 1400 At risk All

Changes to Production Milestones

Description Changed Status
SRM upgrade to 2.8 (H) Shaun DONE
Nameserver upgrade to 2.8 (L) Chris DONE
Move CMS to T10KB (M) Tim Ongoing. Meeting with AS and Chris Brew about how to implement this on 18/9/09.

Advanced Planning

  • CIP upgrade to include nearline publishing (Sept)
  • Black and White lists? (delayed until it is required on a 'per-instance' basis)
  • Improve resiliency to central services (This year)

Staffing

  • Brian away Monday and Tuesday
  • Richard away Monday
  • Castor on Call person: Matthew