RAL Tier1 Operations Report for 26 May 2010

gdss380 (lhcbmdst) was put back into service on 20th. It failed again on Saturday morning with similar symptoms to its previous failure.
Problem with the CIP not publishing GlueSA information on Thursday 20th May 2010. This was noticed and fixed before any tickets were raised.
gdss67 CMS farmRead had problems (FSPROBE errors) and was removed from service on morning of Thursday 20th.
On Thursday 20th there was a Problem with gdss228 and gdss229 being deployed into atlasScratchDisk.
On Monday 24th gdss207 (aliceTap)was removed from service due to possible file system corruption.
Tuesday 25th May 2010 there were problems with the Atlas SRM machines overnight (/var full)
On Tuesday morning, the CMS SRM failed tests because of a stager error

We are still failing some SAM test on the site BDII. Not likely to be fixed in current SAM setup - see GGUS ticket 58054
Ongoing issues with the Atlas software server - lcg0617. The plans in place to replace this machine have now been approved.

The following items remain to be scheduled:

Oracle patching of databases. Will lead to "At Risks" on LUGH (LHCb 3D & LFC) Thursday 27th May and SOMNUS (LFC, FTS) on Wednesday 2nd June.
Doubling of network link to network stack for tape robot and Castor head nodes. Tuesday 1st June (during technical stop). Will require a Castor stop, FTS drain and batch pause. Plan to make co-incident with UPS test.
CEs taken out of production in rotation (one at a time) while glexec configured. (CE08 in progress.)
Preventative maintenance work on transformers in R89. Being scheduled some weeks ahead. Likely to lead to two 'At Risks'.
Closure of SL4 batch workers at RAL

There were 1 unscheduled entry in the GOC DB for this last week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
ogma.gridpp.rl.ac.uk,	SCHEDULED	AT_RISK	25/05/2010 10:00	25/05/2010 12:00	2 hours	At Risk during application of Oracle PSU patches.
srm-lhcb.gridpp.rl.ac.uk,	SCHEDULED	AT_RISK	25/05/2010 09:30	25/05/2010 11:30	2 hours	At Risk on Castor LHCb instance while redundant information cleaned up from Castor database.
lcgce08.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	24/05/2010 10:00	26/05/2010 17:00	2 days, 7 hours	CE being reconfigured for glexec roles mapping
srm-atlas.gridpp.rl.ac.uk,	SCHEDULED	AT_RISK	19/05/2010 09:30	19/05/2010 11:30	2 hours	At Risk on Castor 'Atlas' instance while redundant information cleaned up from Castor database.
lcgce07.gridpp.rl.ac.uk,	UNSCHEDULED	OUTAGE	17/05/2010 17:00	19/05/2010 17:00	2 day,	Downtime to reconfigure glexec. Previous downtime for this machine was mistakenly set to end on the 17/05/2010
lcgvo-02-21.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	17/05/2010 11:00	25/05/2010 13:32	8 days, 2 hours and 32 minutes	CMS SL5 Phedex VObox not yet in production

Tier1 Operations Report 2010-05-26