RAL Tier1 weekly operations castor 20/02/2012

From GridPP Wiki
Jump to: navigation, search

Operations News

  • Nameserver upgraded to 2.1.11-8
  • CMS upgraded to 2.1.11-8

Operations Problems

  • SRM problems following nameserver linked to a failure to update an alias pointing to old nameserver (castorvmgr.ads.rl.ac.uk).
  • Upgraded VMGR caused heavy load. We were running it on both NS's, as before. Once one was turned off, the problem ceased.
  • Ongoing crashing of SRMs, especially ATLAS. A better restarter has been put into place. Possible causes are:
    • SL4 rpms (OS is SL5). We are configuring and testing the preprod SRM setup with upgraded rpms
    • grid-mapfile distribution. A workaround is already in place
    • some other memory problems

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s) Lead by
CASTOR 2.11-8 ATLAS Stager upgrade, inc. move to new hardware+SL5+Quattor 22/02/2012 08:00 22/02/2012 16:00 Downtime ATLAS Matthew
CASTOR 2.11-8 LHCb Stager upgrade, inc. move to new hardware+SL5+Quattor 27/02/2012 08:00 27/02/2012 16:00 Downtime LHCb Matthew
CASTOR 2.11-8 Gen Stager upgrade, inc. move to new hardware+SL5+Quattor 29/02/2012 08:00 29/02/2012 16:00 Downtime Gen Matthew

Advanced Planning

  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Switch from LSF to Transfer Manager after 2.1.11 upgrade. Will need to better stress-test TM on preprod
  • Start using Tape Gateway once CERN have been using it in production for approx. 2 months.

Staffing

  • Castor on Call person: Shaun
  • Staff absence/out of the office:
    • Shaun (Tues)