RAL Tier1 weekly operations castor 02/01/2012

From GridPP Wiki
Jump to: navigation, search

Operations News

  • none

Operations Problems

  • atlasStager var partition close to the limit on 26th Dec
  • readonly disk server gdss307(cmsWanIn) on 27th Dec
  • large number of errors in aliceDisk on 28th Dec, investigation showed all disk servers were busy writing files and some of them were timing out
  • efficiency around 40% for aliceDisk on 29th Dec due to similar problem as on 28th Dec. Number of xrootd job slots were changed from 50 to 100 to accommodate all requests which were timing out
  • atlasStager var partition close to the limit and puppetmaster02 was unresponsive which was restarted on 31st Dec

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s) Lead by
Stage 1 of move to new CASTOR DB hardware 05/01/2012 08:30 05/01/2012 16:00 Downtime All Rich
SRM 2.11 upgrade, inc. move to new hardware+SL5+Quattor (STC) 16/01/2012 08:00 18/01/2012 16:00 Downtime All Shaun
CIP 2.2.0 upgrade (STC) 26/01/2012 10:00 26/01/2012 12:00 At-risk All Matthew
Stage 2 of CASTOR DB move (STC) 07/02/2012 08:00 07/02/2012 16:00 Downtime All Rich
CASTOR 2.11-8 upgrade, inc. move to new hardware+SL5+Quattor (STC) 13/02/2012 08:00 24/02/2012 16:00 Downtime All Matthew

Advanced Planning

  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26

Staffing

  • Castor on Call person: Shaun
  • Staff absence/out of the office:
    • All (Mon)
    • Matthew A/L (all week)