RAL Tier1 Operations Report for 16th November 2011

Review of Issues during the week 9th to 16th November 2011.

The Post Mortem (SIR) has been prepared for the problems with the Atlas Castor instance a week ago (Saturday - Monday 30 Oct - 1 Nov). See: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111031_Castor_ATLAS_Outage

In the middle of last week there were problems with transfers for two LHCb users. This was traced to a root CA certificate that had been introduced in the latest distribution, but which was not rolled out to the 3 LHCb SRMs due to an oversight. (These three SRMs have a different set-up to the other SRMs.)
Since the weekend there have been problems for Alice accessing Castor. The problem is still unresolved despite intensive investigations Monday & Tuesday. At the moment the problem is awaiting input from Alice, and our current thinking is that this is a problem at the Alice end.
Quite large number of pilot jobs reported as failing (GGUS ticket from LHCb). On Monday (14th Nov) two nodes were found to be failing quite a number of jobs, but below limit for 'black hole' detector. Removed these two batch worker nodes from production.
Around 05:00 this morning (Wednesday 16th) there was a DNS problem that affected CERN. This also affected SAM tests of the Tier1 and had some impact on VO usage of our site.

The slow data transfers into the RAL Tier1 from other Tier1s and CERN (i.e. asymmetrical performance) continue to be investigated. Improvements to rates to/from the RAL Tier1 have been made and now only two channels (NIKHEF to RAL; Birmingham to RAL) remain below an acceptable threshold.
We continue work with the Perfsonar network test system to understand some anomalies seen. The initial set-up was on virtual machines. Hardware has now been obtained to run Perfsonar.
We are currently patching all Grid Services nodes that run a BDII.

Gdss456 (AtlasDataDisk) failed with a read only file system on Wednesday 28th September. Fabric Team have completed their work on this server and it is awaiting re-deployment.
As reported in recent weeks, we are seeing a high number of 'SMART' errors reported by a particular batch of disk servers. Most of these are spurious and resolved by an updated version of the disk controller firmware. This update has been successfully applied to D0T1 disk servers and will be rolled out to the affected D1T0 disk servers over the next week or two.

A firmware update was applied to the RAL Firewall yesterday morning (Tuesday 15th Nov) which addresses a problem of some packet loss.

Tuesday 22nd November. Failover of main RAL link (from Reading link to London link) during maintenance. Should be transparent.
Tuesday 29th November. Failover of OPN link to backup during maintenance. Should be transparent.

The following items are being discussed and are still to be formally scheduled and announced:

Update, in rolling manner, the Site and Top-BDII nodes to the UMD release.
Regular Oracle "PSU" patches are pending.
Update the disk controller firmware on D1T0 nodes in the batch of servers reporting spurious SMART errors.
There are also plans to move part of the cooling system onto the UPS supply that may require a complete outage (including systems on UPS).
Switch Castor and LFC/FTS/3D to new Database Infrastructure. This will only proceed once the problem that caused the cancellation of the first stage of this work last week is understood and fixed.
Networking change required to extend range of addresses that route over the OPN.
Address permissions problem regarding Atlas User access to all Atlas data.
Replace hardware running Castor Head Nodes (aimed for end of year).

There was 1 unscheduled entry in the GOC DB for this last week, which was for the problems on the Alice xrootd manager.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Whole Site	SCHEDULED	OUTAGE	15/11/2011 07:55	15/11/2011 09:00	1 hour and 5 minutes	Short outage during work on site network link. We andticipate around a 30 minute break in connectivity but have allowed some contingency.
srm-alice	UNSCHEDULED	OUTAGE	14/11/2011 09:30	14/11/2011 11:25	1 hour and 55 minutes	Investigating problem with xrootd manager.


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
76023	Red	very urgent	in progress	2011-11-05	2011-11-16	Camont	LB query failed
75395	Red	urgent	in progress	2011-10-17	2011-11-15	T2K	WMS 'jumping' (Ticket now with L&B support)
74353	Red	very urgent	waiting for reply	2011-09-16	2011-11-07	Pheno	Proxy not renewing properly from WMS
68853	Red	less urgent	On hold	2011-03-22	2011-11-07		Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077	Red	less urgent	in progress	2011-02-28	2011-09-20		Mandatory WLCG InstalledOnlineCapacity not published
64995	Red	less urgent	in progress	2010-12-03	2011-09-20		No GlueSACapability defined for WLCG Storage Areas