Difference between revisions of "RAL Tier1 weekly operations castor 29/11/2010"

Latest revision as of 15:16, 29 November 2010

On 25/11/10 all ATLAS+CMS SL08 disk servers were put into Read Only mode via LSF, to prevent new files being lost if there is a further catastrophic crash.

On 22/11/10 CMS experienced slowness transferring files from cmsWanOut. 3 disk servers were running very hot. Putting them into draining to distribute the hot files helped.
On 24/11/10 at 00:34 and again on 27/11/10 at 22:59 the CMS jobmanager stopped processing requests for approx. 30 minutes (on both occassions) due to unknown reasons. Afterwards it resumed operations normally. Over these period, transfers to/from RAL failed. We have enabled a second jobmanager instance on CMS to protect us from a future reoccurance.
Very slow connectivity seems to be affecting a number of disk servers in CMS, ATLAS, LHCb. Indications are that there may be a common problem with their networking.

Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production

Entries in/planned to go to GOCDB

Description	Start	End	Type	Affected VO(s)
Update ATLAS to 2.1.9-6	06/12/2010 08:00	08/12/2010 18:00	Downtime	ATLAS

Deploy new puppetmaster
Upgrade ATLAS, CMS, Gen disk servers to 64bit o/s
CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers
CASTOR upgrade to 2.1.9-10 which incorporates the fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
CASTOR for Facilities instance in production by end of 2010