RAL Tier1 weekly operations castor 27/02/2012

From GridPP Wiki
Jump to: navigation, search

Operations News

  • ATLAS upgraded to 2.1.11-8
  • Puppet upgraded to 2.7.11-1
  • 'go-faster stripes' enabled on all 'B' and 'C' tape drives
  • preprod now configured with lcgc*03 headnodes (destined for Gen) + preprod NS for Alice xrootd testing
  • preprod SRMs now configured with updated RPMs and ready for testing. It is hoped that this will help improve the periodic crashing.

Operations Problems

  • ATLAS SRM periodic crashing continuing. Restarter didn't kick in on Thursday, leading to a short time being blacklisted.
  • cleanLostFiles running against 5 disk servers caused stager slowdown on Thursday evening. From now on we will run no more than 3 cleanLostFiles threads and none out of hours.

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s) Lead by
CASTOR 2.11-8 LHCb Stager upgrade, inc. move to new hardware+SL5+Quattor 27/02/2012 08:00 27/02/2012 16:00 Downtime LHCb Matthew
CASTOR 2.11-8 Gen Stager upgrade, inc. move to new hardware+SL5+Quattor 29/02/2012 08:00 29/02/2012 16:00 Downtime Gen Matthew
CIP 2.2.0 upgrade (STC) TBD TBD At-risk All Matthew

Advanced Planning

  • Test and re-apply CIP upgrade
  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26.
  • Stress testing of *11 generation disk servers in preprod during March
  • Switch from LSF to Transfer Manager after 2.1.11 upgrade. Will need to better stress-test TM on preprod with more disk servers.
  • Start using Tape Gateway once CERN have been using it in production for approx. 2 months.


  • Castor on Call person: MV
  • Staff absence/out of the office:
    • ..