RAL Tier1 weekly operations castor 20/02/2012

Operations News

SRM problems following nameserver linked to a failure to update an alias pointing to old nameserver (castorvmgr.ads.rl.ac.uk).
Upgraded VMGR caused heavy load. We were running it on both NS's, as before. Once one was turned off, the problem ceased.
Ongoing crashing of SRMs, especially ATLAS. A better restarter has been put into place. Possible causes are:
- SL4 rpms (OS is SL5). We are configuring and testing the preprod SRM setup with upgraded rpms
- grid-mapfile distribution. A workaround is already in place
- some other memory problems

Entries in/planned to go to GOCDB

Description	Start	End	Type	Affected VO(s)	Lead by
CASTOR 2.11-8 ATLAS Stager upgrade, inc. move to new hardware+SL5+Quattor	22/02/2012 08:00	22/02/2012 16:00	Downtime	ATLAS	Matthew
CASTOR 2.11-8 LHCb Stager upgrade, inc. move to new hardware+SL5+Quattor	27/02/2012 08:00	27/02/2012 16:00	Downtime	LHCb	Matthew
CASTOR 2.11-8 Gen Stager upgrade, inc. move to new hardware+SL5+Quattor	29/02/2012 08:00	29/02/2012 16:00	Downtime	Gen	Matthew

Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
Switch from LSF to Transfer Manager after 2.1.11 upgrade. Will need to better stress-test TM on preprod
Start using Tape Gateway once CERN have been using it in production for approx. 2 months.