Tier1 Operations Report 2010-05-12

RAL Tier1 Operations Report for 12th May 2010.

Gdss397, part of ATLASDATADISK, failed over the bank holiday weekend and was only returned to production on the morning of Saturday 8th May. Apart from a delay owing to the bank holiday weekend, significant time was taken while the RAID array rebuilt.
There was a problem with the batch system on Saturday 8th. The batch system services (PBS, MAUI) needed restarting on the main batch system. There was a delay in getting the cream CEs back into production as there remained another daemon to restart that was missed when resolving the main problem.
In the evening of Monday (10th May) there was a crash of disk server gdss68, part of cmsfarmread (Disk0Tape1). There were no migration candidates on the server at the time and this had negligible impact. It was returned to service the next day.

We have been failing some SAM tests on the Site BDII. On investigation it seems we are not alone in this and a GGUS ticket has been put into the BDII developers.

NGS CE has been declared as an Outage (since 6th May) for decommissioning.
At Risks on remaining Castor instances while a database table is cleaned (as already done for 'GEN' instance).
- Monday 17th May: CMS
- Wednesday 19th May: Atlas
- Tuesday 25th May: LHCb
Tuesday 2nd June (During technical stop) - UPS test (implying site At Risk).

The following items remain to be scheduled:

CEs taken out of production in rotation (one at a time) while glexec configured. (CE01 and CE06 done so far.)
Preventative maintenance work on transformers in R89. Being scheduled some weeks ahead. Likely to lead to two 'At Risks'.
Network change to improve bandwith to tape system.

There were 2 unscheduled entries in the GOC DB for this last week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Castor GEN (srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-t2k.	SCHEDULED	AT_RISK	12/05/2010 09:30	12/05/2010 11:30	2 hours	At Risk on Castor 'GEN' instance while redundant information cleaned up from Castor database.
lcgce06.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	10/05/2010 10:00	12/05/2010 17:00	2 days, 7 hours	Reconfiguring lcgce06 to enable mappings for glexec
All CEs	UNSCHEDULED	OUTAGE	08/05/2010 06:00	08/05/2010 09:00	3 hours	Problem with batch service. Resolved during Saturday morning by on-call. (Outage added retrospectively).
ce.ngs.rl.ac.uk,	UNSCHEDULED	OUTAGE	06/05/2010 10:00	14/05/2010 16:00	8 days, 6 hours	Downtime while decommissioning CE.