RAL Tier1 weekly operations castor 01/11/2010

Work previous week

Matthew:
- GEN Upgrade and Testing
- LHCb problems
- Castor for Facilities planning
Shaun:
- ..
Chris:
- GEN Upgrade and Testing
- Castor Facilities work
- Working on LSF problem in ASGS
Richard:
- Finishing the tape section of the 2.1.9 functional tests on Facilties instance
- Developing a script to run stress tests by running grid jobs
Brian:
- ..
Jens:
- ..

On 26/10/10 LHCb reported that their SRMs were returning malformed TURLs - affecting approx. 4% of transfers. We're not yet clear what is causing this bug, but the last occurrence was on the morning of 30/10/10.
On 27/10/10 LHCb reported they were having recall problems for a file. CERN informed us that this is due to a known bug.
On 30/10/10 while rebooting LHCb to attempt to fix the above malformed TURL problem, one SRM daemon did not reconnect to the database due to a known bug. This created a backlog of requests and bad SRM performance. LHCb was put into downtime until 1300 when the backlog was naturally cleared. On 31/10/10, LHCb was under very high load which again created a backlog. LHCb was again put into downtime until Monday when investigations pointed the problem to be load related, and SRMs were reconfigured to improve performance under high loads. The SRMs are now being replaced by new hardware (and one extra SRM) to improve performance further.
On 1/11/10 the ATLAS SRMs were repeatedly crashing, caused by a new unsupported command being passed to them (statusOfBringOnlineRequest).

Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into production

Entries in/planned to go to GOCDB

Description	Start	End	Type	Affected VO(s)
Update CMS to 2.1.9-6 (STC)	08/11/2010 08:00	10/11/2010 18:00	Downtime	CMS
Update ATLAS to 2.1.9-6 (STC)	22/11/2010 08:00	24/11/2010 18:00	Downtime	ATLAS

New SRM machines for LHCb
Upgrade disk servers to 64bit o/s
CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers
CASTOR upgrade to the latest 2.1.9 which incorporates the fix for grid-ftp-internal to support multiple service classes, enabling checksums for Gen
CASTOR for Facilities instance in production by end of 2010