RAL Tier1 Operations Report for 13th February 2013

Review of Issues during the week 6th to 13th February 2013.

There was a low level SRM problem that caused Atlas to put RAL offline for brief periods.
Maintenance of the R89 machine room air conditioning was completed on 07/02/2013.

Resolved Disk Server Issues

Current operational status and issues

There have been intermittent problems over the past week with the start rate for batch jobs. This is being investigated.
There is a GGUS ticket for a problem seen by the FTS that is caused by a problem within the Castor SRM.
The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this is in place.
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
System set-up for participation in xrootd federated access tests for Atlas.
Test batch queue with five SL6/EMI-2 worker nodes and own CE in place. Currently being tested by Atlas.

Ongoing Disk Server Issues

gdss594 (GenTape) suffered a double drive failure last night (12/02/2013). Fabric are currently investigating.

Notable Changes made this last week

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Replace central switch (C300). (Tentative date 5th March, but Atlas would like earlier). This will:
  - Improve the stack 13 uplink.
  - Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
- Update core Tier1 network and change connection to site and OPN including:
  - Core networking has informed us that they need to re-configure a core switch on 26/02/2013 between 07:30 and 08:30
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Removal of AFS clients from Worker Nodes.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.

Entries in GOC DB starting between 6th and 13th February 2013.

None

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
91251	Red	top priority	In Progress	2013-02-07	2013-02-07	lhcb	CEs don't seem to be running jobs
91146	Red	Urgent	In Progress	2013-02-04	2013-02-12	Atlas	RAL input bandwith issues
91029	Red	Very Urgent	In Progress	2013-01-30	2013-02-11	Atlas	FTS problem in queryin jobs
90528	Red	Less Urgent	In Progress	2013-01-17	2013-02-04	SNO+	WMS not assiging jobs to sheffield
90151	Red	Less Urgent	Waiting Reply	2013-01-08	2013-02-04	NEISS	Support for NEISS VO on WMS
86152	Red	Less Urgent	On Hold	2012-09-17	2013-01-16		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
06/02/13	100	100	100	100	100
07/02/13	100	100	100	100	100
08/02/13	100	100	100	100	100
09/02/13	100	100	99.2	100	100	User timeout, failure to put a file and subsequent failure to delete it.
10/02/13	100	100	100	100	100
11/02/13	100	100	100	100	100
12/02/13	100	100	100	100	100

Tier1 Operations Report 2013-02-13