RAL Tier1 weekly operations castor 06/12/2010

Operations News

2.1.9-10 installed on Facilities instance. Now returned to users for further testing.
New puppetmaster has now been installed and is controlling all Facilities and preprod disk servers.
All CASTOR EMC raid arrays now fed from UPS on one of their power supplies

The slow network problem affecting 47 disk servers was identified to a faulty transceiver which was replaced on Tuesday. All instances apart from Gen had suffered due to this problem, especially CMS.
A power blip on Wednesday afternoon knocked out all disk servers on phase 'A' - around 100 disk servers. Most came up without problem. As of Friday morning, only gdss77 is still out of production.
On Thu/Fri night an unknown problem caused the robot 'playground' to fill up with parked tapes, and disabled a number of handbots. An engineer was called out and returned the parked tapes to production and freed up the disabled handbots.
CMS instance was heavily loaded over the weekend. gdss310 stopped responding and removing it from CASTOR helped. The SRMs repeatedly crashed by StatusOfBringOnline requests crashing frontend. CMS SRMs were upgraded to 2.8-6 to prevent reoccurence.

Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production

Entries in/planned to go to GOCDB

Description	Start	End	Type	Affected VO(s)
Update ATLAS to 2.1.9-6	06/12/2010 08:00	08/12/2010 18:00	Downtime	ATLAS

Deploy new puppetmaster
Upgrade ATLAS, CMS, Gen disk servers to 64bit o/s
CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers
CASTOR upgrade to 2.1.9-10 which incorporates the fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
CASTOR for Facilities instance in production by end of 2010