RAL Tier1 Operations Report for 19th January 2011

Monday 10th - Read only file system reported on GDSS327 (AtlasFarm – D0T1) which had been taken out of production on Monday 10th. System returned to production on Wednesday (12th).
GDSS496 (CMSFarmRead) was taken out of production on Thursday 6th January following a problem. There were two un-migrated files on this server which had to be declared as lost to CMS (13th Jan). This system is redoing the acceptance testing before being returned to production.
Friday 14th Jan. All three partitions on GDSS283 (which had failed on 25th December) were found to be corrupt. The result was the loss of 30 CMS files. A Post Mortem is being prepared for this data loss incident. See:

 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101225_CMS_Disk_Server_GDSS283_Data_Loss

Friday 14th Jan - GDSS305 (CMSWanOut) was removed from production with SCSI errors. It was returned to production that afternoon.
Sunday Jan 16th - GDSS189 (Atlas MCTape) reported a read-only file system and was taken out of production. A sample of files were checked (checksum verification) before the server was returned to production this morning (19th).
Last night and this morning (19th Jan) we had a problem on the batch system where jobs were not being submitted. This was resolved this morning by restarting some components of the batch system. The underlying cause is not yet known.
The OPN link to CERN failed over to the backup link at around 7.30am this morning.
Atlas Castor instance down Monday & Tuesday (17/18 Jan) during which the disk servers were successfully upgraded to a 64-bit OS. This is required for checksumming to be enabled, which was done on the Atlas ScratchDisk earlier this morning (19th) for testing and has now been extended to all Castor Atlas service classes.
Increased shared memory for OGMA (Atlas 3D) done on Tuesday morning 18th January.

Last week we reported a problem with tape migration for LHCb. Some files had bad checksums (Believed due to failed transfers of the files into Castor) and these were blocking tape migrations. Those particular files were cleaned up, but since then more have come in and we again have tape migration blocked.
We are aware of a problem with the Castor Job manager that can occasionally hang up. This has not happened this last week - but it remains an open issue.
On Thursday 23rd December GDSS337 (GenTape) failed. There was only one un-migrated file (for T2K) on it. This server is still out of production and awaiting replacement memory.
We have previously reported a problem with disk servers becoming unresponsive. There was one case of this during this last week: Gdss90 became un-responsive for SSH access during the afternoon of Monday 17th Jan. However, checks showed it was still serving Castor requests. The problem resolved itself at 4am the following night. Work is ongoing (tests on the pre-prod instance) to understand this failure.
Transformer TX2 in R89 is still out of use.

The following items are being discussed and are still to be formally scheduled:

Application of kernel update to batch server (some small risk to batch services).
Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem. Possible dates for this are:
- CMS - Probably Monday 31st Jan / Tuesday 1st Feb.
- GEN - To Be Decided.
Increase shared memory for OGMA, LUGH & SOMNUS (rolling change). (OGMA done).
Address permissions problem regarding Atlas User access to all Atlas data.
Upgrade all Oracle databases from version 10.2.0.4 to 10.2.0.5 (assuming this upgrade goes OK at CERN).
- We are looking at doing some of this at the time of the CMS disk server updates, Probably on 31 Jan / 1 Feb.
Detailed changes to batch configuration to enable scheduling by node.
Network (VLAN) reconfiguration to make more addresses available to the Tier1.

There was one unscheduled entry, an "At Risk" for updates to the top BDII.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
site-bdii	SCHEDULED	AT_RISK	19/01/2011 10:00	19/01/2011 13:00	3 hours	At-risk for applying gLite updates to the RAL site-level BDIIs
lcgce06, lcgce08, lcgce09, srm-atlas	SCHEDULED	OUTAGE	17/01/2011 08:00	18/01/2011 16:00	1 day, 8 hours	Outage on CASTOR Atlas instance for upgrading diskservers to SL5 64-bit, quattorize all remaining non-quattorized diskservers and run fsck on diskservers that need it.
lcgbdii	UNSCHEDULED	AT_RISK	13/01/2011 14:00	13/01/2011 17:00	3 hours	At Risk during application of system updates.

Tier1 Operations Report 2011-01-19