RAL Tier1 weekly operations castor 21/03/2011

Operations News

Tested upgraded CASTOR client on 3 worker nodes from 2.1.7-27 to 2.1.9-6
Power on remaining CASTOR EMC unit was configured to be fed from UPS through isolating transformer during downtime on 15/3/11.

On 11/3/11, one of the three LHCb SRM died and was taken out of the DNS round robin. Its replacement has yet to be tested and put back in.
Sluggish SRM-DB performace on ATLAS and CMS on Monday, indicating network issues, but none could be found. Similar problems

affected LHCB SRMs on Wednesday - this was traced to two "decommissioned" LHCb SRMs (srm204,205) that were still connecting to the DB

On 17/3/11, LHCb accidentally deleted 4100 reconstruction files from their 2010 data. We will try to recover it from tape.

Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Have arrived and we are awaiting installation.

Entries in/planned to go to GOCDB

Description	Start	End	Type	Affected VO(s)
Upgrade CASTOR clients on all WNs from 2.1.7-27 to 2.1.9-6	21 March 10:00	21 March 12:00	At-risk	All
Upgrade CMS to 2.1.10-0 (STC)	28 March 08:00	28 March 16:00	Downtime	CMS
Upgrade ATLAS, LHCb and Gen to 2.1.10-0 (STC)	30 March 08:00	30 March 16:00	Downtime	ATLAS, LHCb, Gen

Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
Move Facilities instance to new Database hardware running 10g
Upgrade tape subsystem to 2.1.10-1 which allows us to support files >2TB
Start migrating from T10KA to T10KC media later this year