RAL Tier1 weekly operations castor 21/12/2009

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

  • Deployed LSF triplets on new hardware(Chris)
  • Continue working on polymorphic castor head nodes (Chris)
  • Additional kernel + security updates (Chris)
  • CoD work (Chris)
  • Attempted to kickstart new puppetmaster but failed. Will discuss with FT next yr (Shaun)
  • SRM development (Shaun)
  • Fixing migration and recall problems (Shaun)
  • Testing 2.1.8-17 clients with SRM2.8-2 on certification (Shaun)
  • Testing new production CIP (Matthew)
  • Testing new T2K CIP (Jens,Matthew)
  • Intervention planning (Matthew)

Developments for this week

  • Reboot central servers to load new kernel (Matthew, All)
  • Implement daily test file archiving for tape-backed storage (Matthew)
  • Testing newly deployed CIPs (Matthew)
  • Assess threat of SCSI errors to production database (Matthew)
  • CoD work (Matthew)

Ongoing work

  • Investigate lhcbUser D2D copy problems (Matthew)
  • More build of castoradm1 replacement (Cheney)

Operations Issues

  • Recalls stopped on CMS and Gen. Restarting rtcopyclientd fixed it.
  • Uranus (DLF) hardware problems. Possibly power related. Switched to LPD power supply and monitoring
  • Some SCSI errors appearing on production Overland. Power related?
  • JamesJ key was compromised. Removing it from all systems broke castormon for ~3 days

Blocking issues

  • Lack of Quattor configuration files for SLC4.8 is stopping us evaluating Quattor alongside CASTOR 2.1.8. Preprod setup will initially proceed with a Kickstart-based deployment.
  • Preprod DB can only be delivered after EMC testing is done (1st week after Jan'10)

Planned, Scheduled and Cancelled Interventions

  • Deploy new CIP for T2K (21/12/09 1200-1300 at-risk)
  • Replace CIP hosting machine with new one with more resilient hardware, (21/12/09 1200-1300)
  • Reboot all central servers (22/12/09 0800-1230 downtime)
  • UPS test (5/12/10 at-risk)

Advanced Planning

  • Gen upgrade to 2.1.8 2010Q1
  • Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)

Staffing

  • Castor on Call person: Matthew