RAL Tier1 weekly operations castor 28/12/2009

From GridPP Wiki
Revision as of 11:42, 24 December 2009 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

  • Deployed production CIP on new hardware and new CIP for T2K (Matthew, Matt)
  • Restarted CASTOR during kernel reboots (Matthew)
  • Debugging and fixing problems on Gen LSF after restart (Shaun, Chris, Matthew)
  • Final configuring EMC hardware and delivered to DB team for testing (Tim, Cheney)
  • Fixed castormon (Brian)
  • Analyzing SCSI error count on DB rack nodes connected to Overland disk array (Cheney)
  • Assess threat of SCSI errors to production database (Matthew, Cheney, Tim)
  • Testing newly deployed CIPs (Matthew)
  • CoD work (Matthew)

Developments for this week

  • Chrismas holiday cover - daily checks (Matthew, Shaun)

Ongoing work

  • Investigate lhcbUser D2D copy problems (Matthew)
  • More build of castoradm1 replacement (Cheney)

Operations Issues

  • Continuing SCSI errors appearing on rack nodes connected to Overland. Power related?
  • rmMasterDaemon crashed upon starup with glibc error, causing unscheduled outage on Gen.

Blocking issues

  • Lack of Quattor configuration files for SLC4.8 is stopping us evaluating Quattor alongside CASTOR 2.1.8. Preprod setup will initially proceed with a Kickstart-based deployment.
  • Preprod DB can only be delivered after EMC testing is done (1st week after Jan'10)

Planned, Scheduled and Cancelled Interventions

  • UPS bypass test (5/12/10 at-risk)
  • Switch back to EMC kit (at-risk during Jan, date TBC)
  • Upgrade of memory to DB node (5 day at-risk during Jan, date TBC)
  • Replace DB voting disk (downtime during Jan, date TBC)

Advanced Planning

  • Gen upgrade to 2.1.8 2010Q1
  • Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)

Staffing

  • Castor on Call person: Matthew, Shaun