RAL Tier1 Operations Report for 27th February 2013

Review of Issues during the week 20th to 27th February 2013.

Overnight Wed/Thu (20/12 Feb) there was a problem with the Castor tape robot. This was resolved during the next day. There was no significant operational impact.
On Saturday evening (23rd Feb) there was a problem with the Atlas Castor instance that lasted a few hours and was fixed by the Castor on-call.
There was planned network intervention yesterday (Tuesday) morning for which a 'warning' was scheduled in the GOC DB and the FTS drained. Rather than the expected two short (few minute) breaks in connectivity the external network connectivity was down for around 30 minutes. Apart from the planned stop of the FTS all services carried on running OK internally.

Resolved Disk Server Issues

GDSS447 (Atlas DataDisk) failed with a read only filesystem in the early hours of Monday (25th Feb). It was returned to production at the end of that afternoon.

Current operational status and issues

There have been intermittent problems over the past few weeks with the start rate for batch jobs. These are still being investigated.
We are investigating a higher job failure rate for LHCb and Atlas. This appears to be caused by job set-ups taking a long time.
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
We are participating in xrootd federated access tests for Atlas.
Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.

Ongoing Disk Server Issues

GDSS594 (GenTape) is still unavailable as it will re-run acceptance testing before being considered for going back into service.

Notable Changes made this last week

On Friday (22nd Feb) a minor change was made to the FTS configuration for some channels (mainly from UK Tier2s to us) in response to a low level of failures owing to a short timeout.
During the last week a number of tape servers have been upgraded to Castor 2.1.13-9.

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Tuesday 12th March: Outage for replacement of core network switch (C300).
A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is being undertaken.
The number of nodes behind the SL6 trial batch queue will be increased (by around 450 job slots) by adding new CPU nodes in.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Replace central switch (C300). (Anticipated for a Tuesday during March). This will:
  - Improve the stack 13 uplink.
  - Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.

Entries in GOC DB starting between 20th to 27th February 2013.

There were no unscheduled entries in the GOC DB for last week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Whole site	SCHEDULED	WARNING	26/02/2013 07:30	26/02/2013 08:30	1 hour	At Risk around two short (few minute) breaks in external connectivity to the RAL Tier1. Will drain FTS for an hour beforehand as a precaution.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
91687	Amber	Less Urgent	In Progress	2013-02-21	2013-02-21	epic	Support for epic.vo.gridpp.ac.uk VO on WMS
91658	Amber	Less Urgent	In Progress	2013-02-20	2013-02-22		LFC webdav support
91146	Red	Urgent	In Progress	2013-02-04	2013-02-12	Atlas	RAL input bandwith issues
91029	Red	Very Urgent	On Hold	2013-01-30	2013-02-27	Atlas	FTS problem in queryin jobs
90528	Red	Less Urgent	Waiting Reply	2013-01-17	2013-02-19	SNO+	WMS not assiging jobs to sheffield
90151	Red	Less Urgent	Waiting Reply	2013-01-08	2013-02-27	NEISS	Support for NEISS VO on WMS
86152	Red	Less Urgent	On Hold	2012-09-17	2013-01-16		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
20/02/13	100	100	99.2	100	100	Single User timeout, failure to put a file. Investigations show that the problem was within Castor, although not understood in detail.
21/02/13	100	100	97.4	100	100	A few failures of SRM test. Investigations suggest that a couple of them were due to the test itself.
22/02/13	100	100	97.3	100	100	Few failures of the SRM 'Put' test.
23/02/13	100	100	93.8	100	100	Problem with Atlas Castor instance fixed by on-call.
24/02/13	100	100	100	100	100
25/02/13	100	100	100	100	100
26/02/13	100	100	96.3	95.9	95.8	Failures of SRM tests triggered by scheduled network intervention.

Tier1 Operations Report 2013-02-27