RAL Tier1 weekly operations castor 20/02/2012
From GridPP Wiki
Revision as of 16:18, 20 February 2012 by Matt viljoen (Talk | contribs)
Contents
Operations News
- Nameserver upgraded to 2.1.11-8
- CMS upgraded to 2.1.11-8
Operations Problems
- SRM problems following nameserver linked to a failure to update an alias pointing to old nameserver (castorvmgr.ads.rl.ac.uk).
- Upgraded VMGR caused heavy load. We were running it on both NS's, as before. Once one was turned off, the problem ceased.
- Ongoing crashing of SRMs, especially ATLAS. A better restarter has been put into place. Possible causes are:
- SL4 rpms (OS is SL5). We are configuring and testing the preprod SRM setup with upgraded rpms
- grid-mapfile distribution. A workaround is already in place
- some other memory problems
Blocking Issues
- none
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) | Lead by |
---|---|---|---|---|---|
CASTOR 2.11-8 ATLAS Stager upgrade, inc. move to new hardware+SL5+Quattor | 22/02/2012 08:00 | 22/02/2012 16:00 | Downtime | ATLAS | Matthew |
CASTOR 2.11-8 LHCb Stager upgrade, inc. move to new hardware+SL5+Quattor | 27/02/2012 08:00 | 27/02/2012 16:00 | Downtime | LHCb | Matthew |
CASTOR 2.11-8 Gen Stager upgrade, inc. move to new hardware+SL5+Quattor | 29/02/2012 08:00 | 29/02/2012 16:00 | Downtime | Gen | Matthew |
Advanced Planning
- Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
- Switch from LSF to Transfer Manager after 2.1.11 upgrade. Will need to better stress-test TM on preprod
- Start using Tape Gateway once CERN have been using it in production for approx. 2 months.
Staffing
- Castor on Call person: Shaun
- Staff absence/out of the office:
- Shaun (Tues)