RAL Tier1 Operations Report for 13th March 2013

Review of Issues during the week 6th to 13th March 2013.

On Wednesday & Thursday (6/7 March) there were problems on the RAL network. One "Outage" plus two "Warnings" were declared in the GOC DB. The problems caused intermittent breaks in the Tier1's connectivity - mainly to the outside world but also to the rest of RAL. The core networking team found and fixed these problems on the Thursday afternoon.
On Monday (11th) there was a problem with tape migration traced to a single corrupt file on a disk server. The data loss has been reported to T2K.
The planned network intervention yesterday (Tues 12th March) overran significantly. A total of a 12-hour outage resulted (almost double that planned). Furthermore, following problems, the network uplink to the UKLight router is now running on a single 10Gbit link, rather than a pair of such links.

Resolved Disk Server Issues

GDSS594 (GenTape), which failed a few weeks ago, has now been retired from service and will be used for spares.

Current operational status and issues

The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
We are investigating a higher job failure rate for LHCb and Atlas. This appears to be caused by job set-ups taking a long time. yesterday (12th Mar) we made a change to run jobs re-niced, although initial results suggest this has not fixed this problem.
Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network. Awaiting confirmation of the effect of the changes yesterday (12th Mar).
The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
We are participating in xrootd federated access tests for Atlas.
Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.

Ongoing Disk Server Issues

GDSS519 (GenTape) was put into a draining mode following discovery of a single corrupt file. Following the migration of all remaining files it has been taken out of production to be checked out.

Notable Changes made this last week

The core network switch in the Tier1 Network has been replaced (Tuesday 12th March) providing more ports for network expansion.
During the network change (Tuesday 12th March) the uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage) was doubled in capacity (from 2*10Gbit to 4*10Gbit) to resolve a bottleneck
Batch queue parameters were modify to run jobs on the worker nodes re-niced.
The RAL site BDIIs have been upgraded to EMI-2.
New EMI-2 WMS nodes (lcgwms04, lcgwms05, lcgwms06) have been added into production; the old EMI-1 ones (lcgwms01, 02, 03) will be drained and retired shortly (anyway by end March)
The Castor client software has been upgraded to version 2.1.13 on one batch of worker nodes.
Updating disk controller firmware on Clustervision '11 batch of disk servers ongoing.

Declared in the GOC DB

None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

A change to the certificate used by the MyProxy server will be introduced on Monday (18th Mar).
A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services
- Upgrade of other EMI-1 components (APEL, UI) under investigation.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check (will require some downtime).

Entries in GOC DB starting between 6th to 13 March 2013.

There were five unscheduled entries in the GOC DB for last week. Three of these (one "Outage", two "Warnings") for the RAL networking problems on Wed/Thu 6/7 March. The other two were unscheduled extensions to a planned downtime to restructure the Tier1 network on 12th March.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Whole Site	UNSCHEDULED	OUTAGE	12/03/2013 19:00	12/03/2013 21:00	2 hours	The main work on our network is over however it is taking a little time to restore services. Unfortunately it is therefore necessary to make a small further extension to our downtime.
Whole Site	UNSCHEDULED	OUTAGE	12/03/2013 15:30	12/03/2013 19:00	3 hours and 30 minutes	Extending Outage as some problems encountered during the intervention to reconfigure the core of the Tier1's network.
Whole Site	SCHEDULED	OUTAGE	12/03/2013 08:45	12/03/2013 15:30	6 hours and 45 minutes	Reconfiguration of core network within the RAL Tier1. Storage (Castor) services will be stopped. LFC stopped. FTS and Batch drained of active transfers/jobs. Other services (e.g. BDII) may see some short breaks in connectivity.
Whole Site	UNSCHEDULED	WARNING	07/03/2013 10:00	07/03/2013 16:30	6 hours and 30 minutes	Some network issues ongoing and under investigation.
Whole Site	UNSCHEDULED	WARNING	06/03/2013 15:00	07/03/2013 10:00	19 hours	At risk while recovering from network outage.
Whole Site	UNSCHEDULED	OUTAGE	06/03/2013 09:30	06/03/2013 15:00	5 hours and 30 minutes	Network outage at the RAL tier1

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
92459	Green	Less Urgent	In Progress	2013-03-12	2013-03-13	EPIC	LFC support for epic.vo.gridpp.ac.uk VO
92266	Amber	Less Urgent	In Progress	2013-03-06	2013-03-08		Certificate for RAL myproxy server
91974	Red	Urgent	In Progress	2013-03-04	2013-03-04		NAGIOS eu.egi.sec.EMI-1 failed on lcgwms01.gridpp.rl.ac.uk@RAL-LCG2
91687	Red	Less Urgent	In Progress	2013-02-21	2013-03-06	epic	Support for epic.vo.gridpp.ac.uk VO on WMS
91658	Red	Less Urgent	In Progress	2013-02-20	2013-02-22		LFC webdav support
91146	Red	Urgent	In Progress	2013-02-04	2013-03-05	Atlas	RAL input bandwith issues
91029	Red	Very Urgent	On Hold	2013-01-30	2013-02-27	Atlas	FTS problem in queryin jobs
86152	Red	Less Urgent	On Hold	2012-09-17	2013-03-06		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
06/03/13	97.3	100	99.2	84.5	91.8	Network problems affecting RAL
07/03/13	100	100	77.3	83.4	83.4	Network problems affecting RAL
08/03/13	100	100	100	100	100
09/03/13	100	100	100	100	100
10/03/13	100	100	100	95.9	100	Single SRM Test failure - User timeout.
11/03/13	100	100	100	100	100
12/03/13	44.2	25.2	43.4	43.4	41.7	Planned network update (C300 replacement) which overran

Tier1 Operations Report 2013-03-13

RAL Tier1 Operations Report for 13th March 2013

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools