RAL Tier1 Operations Report for 9th November 2011

Review of Issues during the week 2nd to 9th November 2011.

The Post Mortem (SIR) is completed for the problems with Castor, or rather database infrastructure behind Castor, over the weekend (Sat & Sun 22/23 Oct). See:

The Post Mortem (SIR) is still under preparation for the problems with the Atlas Castor instance a week ago (Saturday - Monday 30 Oct - 1 Nov).
We note a lack of availability reported by some VOs following the removal of our last non-cream CE, lcgce06, where the VOs had not allowed for cream CEs correctly in availability calculations.
On Thursday 3rd Nov. Following a configuration problem /tmp was unavailable on the batch worker nodes for around an hour.
On Friday 4th Nov. there was a problem with the CMS Castor JobManager which had stopped working at around 07:45. This was fixed by restarting it at 08:30.
Overnight Thursday-Friday 3/4th Nov one of the nodes in the PLUTO RAC (which hosts Castor databases for CMS & GEN) failed with a hardware problem. The database services within the PLUTO RAC failed over correctly and this had no operational impact. The node was replaced on Friday and after running for a few days as a member of the RAC, but not running the database, it was fully enabled yesterday.
This morning (Wed 9th Nov) one of the five top BDII nodes failed (lcgbdii0632). This system has now bee restarted and the service is again at full strength.

On Friday 4th Nov. CMS reported file transfer problems. These were traced to a problem on a single disk server (GDSS295). Investigation revealed that three disk servers that had recently been returned to production were not configured correctly following a re-installation. The upshot was that some Castor set-ups were not done at the time and had to be applied later.

The slow data transfers into the RAL Tier1 from other Tier1s and CERN (i.e. asymmetrical performance) continue to be investigated. Improvements to rates to/from the RAL Tier1 have been made and now only two channels (NIKHEF to RAL; Birmingham to RAL) remain below an acceptable threshold.
We continue work with the Perfsonar network test system to understand some anomalies seen.

Gdss456 (AtlasDataDisk) failed with a read only file system on Wednesday 28th September. Fabric Team have completed their work on this server and it is awaiting re-deployment.
As reported in recent weeks, we are seeing a high number of 'SMART' errors reported by a particular batch of disk servers. Most of these are spurious and resolved by an updated version of the disk controller firmware. This update has been successfully applied to D0T1 disk servers and will be rolled out to the affected D1T0 disk servers over the next week or two.

The update to the "WAN tuning" (tcp sysctl) settings that was removed last week as part of investigations into Atlas Castor problems have been partially replied to continue actions pursuing the asymmetric file transfer rates.
LCGCE06, our last non-CREAM CE has been drained ready for decommissioning.
Monday 7th November: Update to CIP (Castor Information Provider) to fix problem of over-reporting tape capacity.
The merger of the Atlas tape backed diskpools has been completed

Tuesday 15th November. Update to site firewall. We have been warned of a 30 minute break in connectivity.

The following items are being discussed and are still to be formally scheduled and announced:

Update the disk controller firmware on D1T0 nodes in the batch of servers reporting spurious SMART errors.
There are also plans to move part of the cooling system onto the UPS supply that may require a complete outage (including systems on UPS).
Switch Castor and LFC/FTS/3D to new Database Infrastructure. This will only proceed once the problem that caused the cancellation of the first stage of this work last week is understood and fixed.
Networking change required to extend range of addresses that route over the OPN.
Address permissions problem regarding Atlas User access to all Atlas data.
Replace hardware running Castor Head Nodes (aimed for end of year).

There were no entries in the GOC DB for this last week.


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
76023	Yellow	very urgent	in progress	2011-11-05	2011-11-08		LB query failed
75395	Red	urgent	waiting for reply	2011-10-17	2011-11-02	T2K	WMS 'jumping'
74353	Red	very urgent	waiting for reply	2011-09-16	2011-11-07	Pheno	Proxy not renewing properly from WMS
68853	Red	less urgent	On hold	2011-03-22	2011-11-07		Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077	Red	less urgent	in progress	2011-02-28	2011-11-02		Mandatory WLCG InstalledOnlineCapacity not published
64995	Red	less urgent	in progress	2010-12-03	2011-11-02		No GlueSACapability defined for WLCG Storage Areas