RAL Tier1 Operations Report for 21st March 2012

Review of Issues during the week 14th to 21st March 2012.

There was a significant outage of the Tier1 on Friday 16th March owing to problems on the Tier1 network. Network changes when adding a new node seems to have caused a packet storm. The site routers disconnected the Tier1 and a number of our network stacks had problems. In particular one stack failed to recover until a faulty switch was identified (and removed). This incident is being post-mortemed. The incident started at 09:30 and was not completely over until 15:50.

Resolved Disk Server Issues

None

Current operational status and issues.

Work is ongoing to get to the root of the problems that affect the network link between the UKlight and SAR routers. The last break was over a week ago (on Tuesday 13th March). Various components in the affected routers have been changed, although it is too soon to say of the problem is finally resolved.
We have two incidents where the FTS has failed since it was upgraded to version 2.2.8 on Tuesday (6th March). These were on Wednesday 14th and overnight 16/17th. These are actively under investigation with an Oracle Database patch applied this morning (21st March). These and other issues (including agents crashing) have been sent to the developers who have supplied updated RPMs and the FTS was patched just before the start of this meeting. As part of the investigations some of the FTS monitoring was switched off temporarily as this was seen to add load to the FTS database.

Ongoing Disk Server Issues

GDSS392 (CMSTape D0T1) was taken out of production on Sunday evening (18th March). It is currently undergoing tests.

Notable Changes made this last week

All tape servers have now been updated to Castor 2.1.11-8 with the improved tape drivers.
All worker nodes now have the Castor 2.1.11 clients installed.
Castor SRMs have had latest OS patches applied (Tuesday 20th March)
The disk array used in the standby Castor Databases has been replaced to the one required for its final configuration (Tuesday/Wednesday 20/21 Mar).

Forthcoming Work & Interventions

The second batch of worker nodes are expected to go into production within a few weeks.
A further intervention on a power board supplied by the UPS will be needed. This will lead to a very low risk intervention on Tuesday 27th March.

Declared in the GOC DB

Wednesday 28th March - Upgrade of MyProxy to UMD version.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.

Databases:
- Regular Oracle "PSU" patches are pending for SOMNUS (LFC & FTS).
- Switch LFC/FTS/3D to new Database Infrastructure.
- Update LFC/FTS databases to Oracle 11.
Castor:
- Update the Castor Information Provider (CIP) (Need to re-schedule.)
- Move to use Oracle 11g (requires a minor Castor update.)
Networking:
- Install new Routing & Spine layers for Tier1 network.
- Main RAL network updates - early summer.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.

Entries in GOC DB starting between 14th and 21st March 2012.

There were eight unscheduled entries in the GOC DB for this last week. These relate to the Tier1 site outage on Friday (16th) and the problems with the FTS (both reported above).

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgfts, lfc.gridpp.rl.ac.uk	UNSCHEDULED	WARNING	21/03/2012 10:00	21/03/2012 12:00	2 hours	Warning (At Risk) while applying patch to Oracle database to fix ongoing problem with FTS.
All Castor SRMs	SCHEDULED	WARNING	20/03/2012 11:00	20/03/2012 14:00	3 hours	Application of OS patches to SRM nodes.
lcgfts	UNSCHEDULED	WARNING	20/03/2012 09:30	20/03/2012 15:30	6 hours	Warning (At Risk) During investigation of ongoing problem on the FTS.
lcgfts	UNSCHEDULED	WARNING	17/03/2012 10:00	19/03/2012 12:00	2 days, 2 hours	FTS at risk due to possible issues with the backend database
lcgfts	UNSCHEDULED	OUTAGE	17/03/2012 00:00	17/03/2012 10:00	10 hours	FTS downtime due to problems with the backend database
Batch (All CEs) and srm-atlas, srm-cms, srm-lhcb	UNSCHEDULED	OUTAGE	16/03/2012 13:00	16/03/2012 15:50	2 hours and 50 minutes	Following Network Problems many services are back. However, problems still affect Castor for Atlas, CMS and LHCb and batch services.
Whole Site	UNSCHEDULED	OUTAGE	16/03/2012 11:00	16/03/2012 13:05	2 hours and 5 minutes	Site Outage following network problems. Whilst we have largely recovered from the internal network problem we have to systematically check services.
Whole Site	UNSCHEDULED	OUTAGE	16/03/2012 09:30	16/03/2012 11:00	1 hour and 30 minutes	Outage on whole site while we investigate network problems
lcgfts.gridpp.rl.ac.uk,	UNSCHEDULED	OUTAGE	14/03/2012 15:00	14/03/2012 18:30	3 hours and 30 minutes	FTS downtime due to problems with the backend database.

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
80471	Green	Urgent	In Progress	2012-03-21	2012-03-21	Atlas	Error with credential in FTS transfers to/from UKI-LT2-QMUL
80119	Green	Less Urgent	Waiting Reply	2012-03-12	2012-03-21	SNO+	ROOT build failing
79428	Red	Less Urgent	Waiting Reply	2012-02-21	2012-03-19	SNO+	glite-wms-job aborted
68853	Red	Less Urgent	On hold	2011-03-22	2012-03-12		Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)

Tier1 Operations Report 2012-03-21

Contents

RAL Tier1 Operations Report for 21st March 2012

Review of Issues during the week 14th to 21st March 2012.

Resolved Disk Server Issues

Current operational status and issues.

Ongoing Disk Server Issues

Notable Changes made this last week

Forthcoming Work & Interventions

Declared in the GOC DB

Advanced warning for other interventions

Entries in GOC DB starting between 14th and 21st March 2012.

Open GGUS Tickets

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools