RAL Tier1 Operations Report for 23rd November 2011

Review of Issues during the week 16th to 23rd November 2011.

Thursday 17th an internal network problem at RAL made RAL-LCG2 unavailable from 07:50 to 09:00. This was caused by the failure of a power supply for one of the main RAL routers. Once this recovered there was a problem with the LHCb Castor instance where the LSF scheduler within Castor had lost contact to some of its disk servers. This was resolved at around 14:45 (and the other Castor instances checked for the same failure as well). There were also around 300 batch job failures at the time, although it later became clear that more batch jobs had probably run into trouble and this was only resolved some days later (see below).
Late evening on Monday (21st Nov) there was a problem on the Oracle database that caused a problem for Atlas & LHCb Castor instances for an hour or two shortly before midnight. This was traced to an Oracle bug and was resolved by the on-call team.
Tuesday morning (22nd) there was scheduled maintenance work on both the main RAL link (to Reading) and the OPN link to CERN. Both failed over to the backup routes for a while between 07:00 and 08:00.
During the second half of Tuesday 22nd we were not starting enough batch job and the farm was partly empty. This was traced to some stuck jobs. Attempts to clear these out late Tuesday afternoon helped but did not resolve the problem. Further work on Wednesday morning (today) has resolved the issue. The main cause seems to be linked to jobs that started on Thursday (17th) and was possibly triggered by the networking problem of that day.

Resolved Disk Server Issues

None.

Current operational status and issues.

The slow data transfers into the RAL Tier1 from other Tier1s and CERN (i.e. asymmetrical performance) continue to be investigated. Improvements to rates to/from the RAL Tier1 have been made and now only two channels (NIKHEF to RAL; Birmingham to RAL) remain below an acceptable threshold.
We continue work with the Perfsonar network test system to understand some anomalies seen. The initial set-up was on virtual machines. Hardware has now been obtained to run Perfsonar.

Ongoing Disk Server Issues

Gdss456 (AtlasDataDisk) failed with a read only file system on Wednesday 28th September. Fabric Team have completed their work on this server and it is awaiting re-deployment.
On Tuesday morning gdss375 (AtlasTape D0T1) had two failed drives and was taken out of production.

Notable Changes made this last week

Started roll-out of UMD version of Top BDII (Site BIDII roll-out also under-way).
Firmware update of all remaining disk servers in the particular batch to resolve spurious 'SMART' errors

Forthcoming Work & Interventions

Tuesday 29th November. Failover of OPN link to backup during maintenance. Should be transparent.

Declared in the GOC DB

None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced:

Update, in rolling manner, the Site and Top-BDII nodes to the UMD release.
Regular Oracle "PSU" patches are pending.
There are also plans to move part of the cooling system onto the UPS supply. The use of temporary power arrangements means this should no longer require downtime of computer systems.
Switch Castor and LFC/FTS/3D to new Database Infrastructure. The problems that caused the postponement of this migration are now understood and, apart from some detailed re-configuration, should be ready to go at the start of the new year.
Networking change required to extend range of addresses that route over the OPN.
Address permissions problem regarding Atlas User access to all Atlas data.
Replace hardware running Castor Head Nodes (aimed for end of year).
Updates to the RAL DNS infrastructure (replacing DNS servers)

Entries in GOC DB starting between 16th to 23rd November 2011.

There was 1 unscheduled entry in the GOC DB for this last week, which was for the problem on the site network.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Whole Site	UNSCHEDULED	OUTAGE	17/11/2011 07:45	17/11/2011 09:00	1 hour and 15 minutes	Site Outage following network failure. (GOCDB item added retrospectively).

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
76750	Green	Very Urgent	In progress	2011-11-23	2011-11-23	T2K	Jobs get aborted due to proxy(?) issues
76735	Green	Urgent	In progress	2011-11-22	2011-11-23	vo.londongrid.ac.uk	lcglb02 GSS error
76564	Amber	Very urgent	In progress	2011-11-17	2011-11-18		geant4 jobs abort on lcgce05.gridpp.rl.ac.uk
76521	Yellow	less urgent	waiting for reply	2011-11-16	2011-11-22	snowplus	Support for snoplus.snolab.ca
75395	Red	urgent	waiting for reply	2011-10-17	2011-11-22	T2K	WMS 'jumping' (Ticket now with L&B support)
74353	Red	very urgent	waiting for reply	2011-09-16	2011-11-22	Pheno	Proxy not renewing properly from WMS
68853	Red	less urgent	On hold	2011-03-22	2011-11-07		Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077	Red	less urgent	in progress	2011-02-28	2011-09-20		Mandatory WLCG InstalledOnlineCapacity not published
64995	Red	less urgent	in progress	2010-12-03	2011-09-20		No GlueSACapability defined for WLCG Storage Areas

Tier1 Operations Report 2011-11-23

Contents

RAL Tier1 Operations Report for 23rd November 2011

Review of Issues during the week 16th to 23rd November 2011.

Resolved Disk Server Issues

Current operational status and issues.

Ongoing Disk Server Issues

Notable Changes made this last week

Forthcoming Work & Interventions

Declared in the GOC DB

Advanced warning for other interventions

Entries in GOC DB starting between 16th to 23rd November 2011.

Open GGUS Tickets

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools