Tier1 Operations Report 2010-05-19

RAL Tier1 Operations Report for 19th May 2010.

gdss380 (lhcbmdst) crashed with a disk failure on Friday night / Saturday morning. Later on Sunday, there was a GGUS ticket GGUS 58253 about disk space. We put a new server in place in lhcbmdst on Sunday. Problems with files not being available persisted until this morning (19th May).
A tape problem on Friday 14th caused the loss of 2 files belonging to the MINOS VO. The MINOS experiment representatives were informed by email.
A tape problem on Monday 17th caused the loss of a single file for Atlas who were informed (GGUS ticket).

We are still failing some SAM test on the site BDII. Not likely to be fixed in current SAM setup - see GGUS ticket 58054
Ongoing issues with the Atlas software server - lcg0617. There are plans in place to replace this machine.

At Risks on remaining Castor instances while a database table is cleaned (as already done for 'GEN' & 'CMS' instances).
- Wednesday 19th May: Atlas (today)
- Tuesday 25th May: LHCb
Wednesday 19th May: lcgce07.gridpp.rl.ac.uk Downtime for glexec upgrade.
Until Wednesday 26th May: lcgvo-02-21.gridpp.rl.ac.uk VO box not yet in production.
Tuesday 1st June (During technical stop) - UPS test (implying site At Risk).

The following items remain to be scheduled:

Oracle patching of databases. Will lead to "At Risks" on OGMA (Atlas 3D) on Tuesday 25th May, LUGH (LHCb 3D & LFC) Thursday 27th May and SOMNUS (LFC, FTS) on Wednesday 2nd June.
Doubling of network link to network stack for tape robot and Castor head nodes. Tuesday 1st June (during technical stop). Will require a Castor stop, FTS drain and batch pause. Plan to make co-incident with UPS test.
CEs taken out of production in rotation (one at a time) while glexec configured. (CE01 and CE06 done so far, CE07 in progress.)
Preventative maintenance work on transformers in R89. Being scheduled some weeks ahead. Likely to lead to two 'At Risks'.
Closure of SL4 batch workers at RAL

There were 2 unscheduled entries in the GOC DB for this last week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgce07.gridpp.rl.ac.uk,	UNSCHEDULED	OUTAGE	17/05/2010 17:00	19/05/2010 17:00	2 day,	Downtime to reconfigure glexec. Previous downtime for this machine was mistakenly set to end on the 17/05/2010
lcgvo-02-21.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	17/05/2010 11:00	26/05/2010 11:00	9 days,	CMS SL5 Phedex VObox not yet in production
lcgce07.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	17/05/2010 10:00	17/05/2010 17:00	7 hours	Downtime to reconfigure glexec. Alternate CEs are available for the affected VOs.
srm-cms.gridpp.rl.ac.uk,	SCHEDULED	AT_RISK	17/05/2010 09:30	17/05/2010 11:30	2 hours	At Risk on Castor CMS instance while redundant information cleaned up from Castor database.
lcgvo-02-21.gridpp.rl.ac.uk,	UNSCHEDULED	OUTAGE	14/05/2010 15:15	17/05/2010 11:00	2 days, 19 hours and 45 minutes	CMS SL5 Phedex VObox not yet in production
srm-alice.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-hone.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk,	SCHEDULED	AT_RISK	12/05/2010 09:30	12/05/2010 11:30	2 hours	At Risk on Castor 'GEN' instance while redundant information cleaned up from Castor database.
lcgce06.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	10/05/2010 10:00	12/05/2010 11:35	2 days, 1 hour and 35 minutes	Reconfiguring lcgce06 to enable mappings for glexec