RAL Tier1 weekly operations castor 02/11/2009

From GridPP Wiki
Revision as of 15:53, 2 November 2009 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

  • SRM 2.8-2 deployment on all instances (Shaun)
  • Developed DB fix to allow checksumming to work on 2.1.7 (Shaun)
  • Deploying new disk servers (Chris)
  • Deployed new disk servers for LHCb,ATLAS,CMS (Chris)
  • Setting up repack (Chris)
  • Split grid-mapfile across instances (Chris)
  • Fixed FC problem (Cheney)
  • Bringing up Vulcan array (Cheney)
  • Enabling monitoring of Overland array (Cheney)
  • Enabling SNMP monitoring of arrays (Cheney)
  • Working on tape server problems (Cheney, Tim)
  • Nagios training for John and Tiju (Cheney)
  • Continuing to investigate EMC problems (Tim)
  • CastorMon monitoring graphs for Gen instance (Brian)
  • Draining disk servers for LHCb to move from lhcbDst->lhcbRawRdst (Brian, Matthew)
  • Disaster Management of recent data-loss (Matthew)
  • Deploying 4 new disk servers for repack server (Matthew)
  • Incident review of CIP upgrade (Matthew, Jens, All)
  • CASTOR-Fabric work proposal (All)

Developments for this week

  • Improving resilience on central servers (Chris, Shaun)
  • Working on puppet manifest for polymorphic central servers (Chris)
  • Setup 2.1.8 on repack server (Chris)
  • Installing T10KB drives (Tim)
  • CastorMon monitoring graphs for Gen instance (Brian)
  • Building Quattor templates for preprod (Richard)
  • Deploying 3 new tape servers for repack server (Matthew)
  • Move 2 disk servers from lhcbDst->lhcbRawRdst (Matthew)
  • Deploying new disk servers (Matthew, Shaun)
  • Reviewing preprod plans (Matthew)
  • Disaster recovery document (Matthew)

Operations Issues

  • Broken FC cable leads to a failed DB node and instability of redistributed databases. In particular, the Nameserver was rendered unavailable for 1h (Mon-Tues).
  • Unknown problems affected Gen SRM database (Wed)
  • Disk server status become unsynchronized for unknown reasons between RMMaster+Stager on CMS(Thurs)

Blocking issues

none

Planned, Scheduled and Cancelled Down Times

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Application of Quarterly ORACLE patches 10/11/09 0900 10/11/09 1300 At Risk All instances

Changes to Production Milestones

none

Advanced Planning

  • Black and White lists will be tested and introduced on ATLAS
  • Install/enable gridftp-internal on Gen (This year)

Staffing

  • Tim away (Tues)
  • Castor on Call person: Chris