RAL Tier1 weekly operations castor 12/10/2009

From GridPP Wiki
Revision as of 10:49, 12 October 2009 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

  • Dealing with fallout from ORACLE disk contoller crash & getting back to service (All)
  • Adding new RAID controller into D1T0 disk servers (Chris, Matt, Prod team, Fabric team)
  • Preparing for CASTOR F2F meeting (Matt, Chris)

Developments for this week

  • CASTOR F2F meeting (Matt, Chris)
  • Setup 2.1.8 on repack server with Puppet (Chris)
  • Working on puppet manifest for polymorphic central servers (Chris)

Ongoing

  • 2.8-1 deployment on Gen,LHCb,CMS (Shaun)
  • CastorMon monitoring graphs for Gen instance (Brian)
  • Black and White list tests (Chris)
  • Disaster recovery document (Matt)

Operations Issues

  • ORACLE disk controller crash
  • Lost data resulting from crash (TBC).

Blocking issues

  • Problems with ganglia check on GEN instance delaying work on monitoring (in hand)

Planned, Scheduled and Cancelled Down Times

none

Changes to Production Milestones

Advanced Planning

  • Add extra raid controller to LHCb D1T0 servers
  • Black and White lists? (delayed until it is required on a 'per-instance' basis)
  • Improve resiliency to central services (This year)

Staffing

  • Brian A/L
  • Richard away
  • Matt, Chris at CERN (Mon-Wed)
  • Castor on Call person: Matt, Tim (during Mon-Wed daytime only)