RAL Tier1 weekly operations castor 24/08/2009

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

  • CIP development on certification (Jens)
  • Written lsf archiving mechanism which archives logs for 3 months on DMF and keeps 24 hours log on local partition (Chris)
  • Restart and recovery of all instacnes following AC failure (All)
  • Development of load test generator for database tuning testing (Chris)
  • Redeployed hotspare from preprod to production instances (Chris)
  • Identification of historical data for deletion from ATLAS MCDISK (Brian)
  • Improved database metric monitoring (Eter)
  • Fixed (?) partitioning problem on ATLAS DLF (Rich)
  • Investigation of problems with repacking some tapes (Tim/Shaun)
  • Media cleaning and robot checkout following water ingress (Tim)
  • Deployment of new puppet restarter (Shaun)
  • Tested database cleanup procedure in prep for 2.1.8 CASTOR u/g (Shaun)

Developments for this week

  • CIP development on certification (Jens)
  • Implement newly written lsf archiving mechanism in production (Chris)
  • Review and update disk server deployment procedure (Chris)
  • Co-ordinate ATLAS data reconciliation (Brian)
  • CASTOR Database tuning testing (Database team)
  • Prepare certification for SRM 2.8 testing (Shaun)

Ongoing

  • CastorMon monitoring graphs for Gen instance (Brian)
  • Cleaning up database for a future 2.1.8 upgrade (Shaun)
  • Setting up Preproduction (Matt, Chris)

Operations Issues

  • Lost gdss169. ATLAS informed and data should be being replicated.
  • Expired CRL's caused problems bring CASTOR instances back up.
  • Tape migration problems on CMS on 21 Aug. 2009

Blocking issues

  • Problems with ganglia check on GEN instance delaying work on monitoring
  • Testing gridFTP internal delaying work on SRM upgrade

Scheduled and Cancelled Down Times

none

Changes to Production Milestones

none

Advanced Planning

  • CIP upgrade to include nearline publishing (August)
  • SRM 2.8 upgrade (August)
  • Work with Fabric to add extra RAID card in remaining Viglen'06 disk servers (Second half of August)
  • Database optimization tasks (September)
  • Upgrade nameserver to 2.1.8 (Possibly during September)
  • Black and White lists? (Possibly during September)
  • Improve resiliency to central services (This year)

Staffing

  • Castor on Call person: Chris
  • Cheney on A/L
  • Matt on A/L