Difference between revisions of "RAL Tier1 weekly operations castor 05/03/2012"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 07:47, 5 March 2012

Operations News

  • Gen upgraded to 2.1.11-8. Only SL4 headnodes are now the older tape servers.
  • (Thu) SRMs emergency upgraded to 2.11-1 to workaround FED crashing. All SL4 RPMs on lcgsrm03 upgraded to SL5 to try to fix the underlying problem.

Operations Problems

  • Ongoing SRM crashing continued until Thursday when we upgraded the SRMs - which fixed the crashing
  • Newly deployed gdss535 (lhcb) was found to have a wrong routing table and was removed from prod and reinstalled

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s) Lead by
Move to new DB hardware with DataGuard 6 Mar 10:00 6 Mar 14:00 Downtime All Richard
CIP 2.2.0 upgrade (STC) TBD TBD At-risk All Matthew

Advanced Planning

  • Test and re-apply CIP upgrade
  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26.
  • Stress testing of *11 generation disk servers in preprod during March
  • Switch from LSF to Transfer Manager after 2.1.11 upgrade. Will need to better stress-test TM on preprod with more disk servers.
  • Start using Tape Gateway once CERN have been using it in production for approx. 2 months.

Staffing

  • Castor on Call person: Chris
  • Staff absence/out of the office:
    • Shaun at EUDAT (Wed-Fri)
    • Rob on training all week