RAL Tier1 Operations Report for 6th March 2013

Review of Issues during the week 27th February to 6th March 2013.

A quiet week operationally (although note the current issues section below). An emergency reboot of one of the site routers on Tuesday late afternoon (5th March) did not cause any operational problem.

Resolved Disk Server Issues

GDSS648 (LHCbDst) failed in the early hours of Sunday morning (3rd March). A faulty network card was replaced and the system returned to production around midday on Monday (4th March).

Current operational status and issues

This morning (Wed 6th March) - intermittent network connectivity problems being investigated.
There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
We are investigating a higher job failure rate for LHCb and Atlas. This appears to be caused by job set-ups taking a long time. We are investigating running jobs re-niced.
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage). (Anticipate resolution of this during intervention on 12th March).
Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network. (Scheduled intervention on 12th March will progress this.)
The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
We are participating in xrootd federated access tests for Atlas.
Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.

Ongoing Disk Server Issues

GDSS594 (GenTape) is still unavailable as it will re-run acceptance testing before being considered for going back into service.

Notable Changes made this last week

All remaining tape servers have now been upgraded to Castor 2.1.13-9.
The number of nodes behind the SL6 trial batch queue has been increased with a few hundred job slots now available.
Disk controller firmware updates in the 2011 Clustervision batch of disk servers (ongoing).

Declared in the GOC DB

Site outage on Tuesday 12th March for replacement of core network switch (C300).

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Replace central switch (C300). (Planned for a Tuesday during March). This will:
  - Improve the stack 13 uplink.
  - Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services
- Upgrade of Site-BDII & WMS from EMI-1 to EMI-2 by end of March.
- Upgrade of other EMI-1 components (APEL, UI) under investigation.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check (will require some downtime).

Entries in GOC DB starting between 27th February to 6th March 2013.

There were no entries in the GOC DB for last week.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
91974	Green	Urgent	In Progress	2013-03-04	2013-03-04		NAGIOS eu.egi.sec.EMI-1 failed on lcgwms01.gridpp.rl.ac.uk@RAL-LCG2
91687	Red	Less Urgent	Waiting Reply	2013-02-21	2013-03-06	epic	Support for epic.vo.gridpp.ac.uk VO on WMS
91658	Red	Less Urgent	In Progress	2013-02-20	2013-02-22		LFC webdav support
91146	Red	Urgent	In Progress	2013-02-04	2013-03-05	Atlas	RAL input bandwith issues
91029	Red	Very Urgent	On Hold	2013-01-30	2013-02-27	Atlas	FTS problem in queryin jobs
90528	Red	Less Urgent	In Progress	2013-01-17	2013-02-19	SNO+	WMS not assiging jobs to sheffield
86152	Red	Less Urgent	In Progress	2012-09-17	2013-03-06		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
27/02/13	100	100	97.5	95.9	100	Atlas: Few SRM test timeouts, CMS: Single SRM test timeout.
28/02/13	100	-100	100	100	100	Problem with ALICE monitoring
01/03/13	100	-100	100	95.9	100	Problem with ALICE monitoring, CMS Single SRM test timeout.
02/03/13	100	-100	100	100	100	Problem with ALICE monitoring
03/03/13	100	-100	100	100	100	Problem with ALICE monitoring
04/03/13	100	-100	100	100	100	Problem with ALICE monitoring
05/03/13	100	100	100	95.8	100	Single SRM test failure coincident with network router reboot.

Tier1 Operations Report 2013-03-06