Latest revision as of 13:15, 30 November 2011

RAL Tier1 Operations Report for 30th November 2011

In the middle of last week we had a problem starting Alice batch work. Early Thursday morning (24th Nov) Alice found and fixed a problem with the Alice VO box that resolved the issue.
On Thursday afternoon (24th Nov) we found a problem of a very low start rate on the farm. This was traced to batch jobs being in a queued state but with a particular execution host allocated. The particular node was disabled from the batch system, followed by the affected jobs being deleted, and the job start rate rose back to normal levels.

Gdss456 (AtlasDataDisk), which had failed with a read only file system on Wednesday 28th September. This server had been replaced on 3rd November.
On Tuesday morning (22nd Nov) gdss375 (AtlasTape D0T1) had two failed drives and was taken out of production. It was returned to production on Friday morning (25th Nov.)
On Tuesday (29th) the monitoring reported a problem with the xroot daemon on gdss569 (LHCbDst D1T0). This was traced to the clock being out by 90 seconds and fixed later that day.

The slow data transfers into the RAL Tier1 from other Tier1s and CERN (i.e. asymmetrical performance) continue to be investigated. Improvements to rates to/from the RAL Tier1 have been made and now only one channel (Birmingham to RAL) remains a cause for concern.
We continue work with the Perfsonar network test system to understand some anomalies seen. The initial set-up was on virtual machines. Hardware has now been obtained to run Perfsonar.
CERN has reported issues with AFS callbacks to RAL worker nodes (29/11/2011). This is being investigated.

GDSS296 (CMSFarmRead - D0T1) was set “read-only” on Monday (28th), and will be removed from production. This follows the “checksum-mismatch" daily checks throwing up four files that had been written the day before as corrupt.

Saturday 10th December. Replacement of some DNS servers at RAL. These are ones not mainly used by the Tier1. The two remaining DNS servers mainly used by the Tier1 will be updated in January.

The following items are being discussed and are still to be formally scheduled and announced:

Regular Oracle "PSU" patches are pending.
There are also plans to move part of the cooling system onto the UPS supply. The use of temporary power arrangements means this should no longer require downtime of computer systems.
Switch Castor and LFC/FTS/3D to new Database Infrastructure. The problems that caused the postponement of this migration are now understood and, apart from some detailed re-configuration, should be ready to go at the start of the new year.
Networking change required to extend range of addresses that route over the OPN.
Address permissions problem regarding Atlas User access to all Atlas data.
Replace hardware running Castor Head Nodes (aimed for end of year).
Updates to the RAL DNS infrastructure (replacing DNS servers)

There were no entries in the GOC DB for this last week.


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
76877	Green	Urgent	In progress	2011-11-29	2011-11-29	T2K	FTS transfers RALLCG2-VICTORIALCG2
76750	Green	Very Urgent	In progress	2011-11-23	2011-11-29	T2K	Jobs get aborted due to proxy(?) issues
76735	Green	Urgent	In progress	2011-11-22	2011-11-25	vo.londongrid.ac.uk	lcglb02 GSS error
76564	Amber	Very urgent	waiting for reply	2011-11-17	2011-11-29		geant4 jobs abort on lcgce05.gridpp.rl.ac.uk
75395	Red	urgent	unsolved	2011-10-17	2011-11-28	T2K	WMS 'jumping' (Set unsolved by L&B support.)
74353	Red	very urgent	waiting for reply	2011-09-16	2011-11-22	Pheno	Proxy not renewing properly from WMS
68853	Red	less urgent	On hold	2011-03-22	2011-11-07		Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077	Red	less urgent	in progress	2011-02-28	2011-09-20		Mandatory WLCG InstalledOnlineCapacity not published
64995	Red	less urgent	in progress	2010-12-03	2011-09-20		No GlueSACapability defined for WLCG Storage Areas