Tier1 Operations Report 2010-02-24

RAL Tier1 Operations Report for 24th February 2010.

A LHCb Disk server GDSS378 was reported (by LHCb) as having problems over the weekend. This was traced to a network (routing) table problem, finally fixed on Tuesday by flushing the ARP cache. This is a similar problem to that which occurred on two Atlas nodes last week.
The long standing problem with Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class has been fixed.
Following the increase in the memory limit on the '3GB' batch queue to 4GB for Atlas, the number of 'blocked' job slots has been monitored. This has not shown to be a significant problem unless other problems on the farm exacerbate the issue.
(Note added after meeting) there was a problem on a CMS disk server (gdss364, part of cmsFarmRead - D0T1) that has had a problem with the disk controller card.
(Note added after meeting) The Atlas MCDISK area became full. This was because some disk servers had been drained ahead of their replacement. These were marked as 'disabled', but in some views their disk space is still visible. Three disk servers were put back into production on Monday (22nd Feb) to resolve this issue.

Load problem on Atlas software server seen in the early hours of Monday 22nd February. At around 01:30 the network traffic in/out of the software server showed a significant change in behaviour. On Monday the software server was very unresponsive and it was clear many Atlas batch jobs were not advancing. This situation persisted and during Tuesday all Atlas batch work was paused. The server was rebooted but on restarting a small number of batch jobs similar access patterns to the software server were seen again. At the end of Tuesday all the paused Atlas jobs were restarted, with the aim of killing them. However, most (if not all) died immediately. A job limit of 50 jobs was left in place overnight and the access patterns to the software server again look normal (Wednesday morning). The job limit will be lifted once this morning's interventions are completed.
The Castor system remains running with less resilience than hoped, as reported last week. Work has taken place to improve resilience where practical (e.g. the memory upgrades in the Castor Oracle RAC nodes have been completed).
On 31st December a recurrence of the Castor/Oracle 'BigID' problem was seen. This is still under investigation.

There are no items scheduled in the GOC DB.

This morning was an intervention to change a parameter in the Castor Oracle databases to fix the ten-minute 'hang' when a RAC node reboots. Also, an At Risk on the 3D services (and lhcb-lfc) while an Oracle patch was applied.

The following items remain to be scheduled:

Investigations into the lack of resilience of the Castor Oracle infrastructure may produce a requirement for an intervention. In addition the following changes in this system have not yet been carried out.
- removal of unstable node from Oracle RAC and its replacement by another node.
Adding the second CPU back into the UKLIGHT router.
At Risks for roll-out of Oracle security patch (already done on 3D systems).
Upgrade to FTS 2.2.

No UNSCHEDULED outages

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lhcb-lfc, lugh, ogma	SCHEDULED	AT_RISK	24/02/2010 11:00	24/02/2010 12:00	1 hour	At Risk on 3D services and LHCb-LFC during application of Oracle security patch.
All Castor & batch	SCHEDULED	OUTAGE	24/02/2010 10:00	24/02/2010 11:00	1 hour	Outage to Castor and Batch during a reconfiguration of Castor Oracle back-end databases. Castor will be unavailable and the batch system will be paused at that time.
lcgfts.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	24/02/2010 09:00	24/02/2010 11:00	2 hours	Outage of FTS during Castor downtime. This includes a drain of FTS transfers ahead of the intervention.
All Castor	SCHEDULED	AT_RISK	23/02/2010 10:00	23/02/2010 16:00	6 hours	At-risk for CASTOR to reconfigure NFS to backup from a different location. The date of this at risk has been changed from the 18th due to staffing constraints.
lcgce07	SCHEDULED	OUTAGE	19/02/2010 14:00	23/02/2010 16:00	4 days, 2 hours	We need to take lcgce07 offline to repair a broken software RAID. This downtime is to allow for draining and the repair.
All Castor	SCHEDULED	AT_RISK	17/02/2010 10:30	17/02/2010 16:00	5 hours and 30 minutes	At Risk on castor during memory update on Oracle RAC nodes. This has been moved from Tuesday because of staffing constraints.