RAL Tier1 Operations Report for 15th May 2013

Review of Issues during the week 8th to 15th May 2013.

There have been two occasions during the last week when the OPN link to CERN has failed over to the backup route (Wed 8 May and around midnight Sat/Sun 11/12 May). In each case the link switched back to the primary after two to three hours and had no operational effect.
A normally routine swap of a failed fan in a disk array took down the standby Castor databases for a while last Friday (10 May). This did not affect operations.
There has been a high rate of outbound traffic saturating the uplink (currently 10Gbit) for the last couple of days. Investigations show this is predominantly Atlas traffic to many different sites.

Resolved Disk Server Issues

Current operational status and issues

The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
The problem LHCb and Atlas jobs failing due to long job set-up times remains and investigations continue.
The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
We are participating in xrootd federated access tests for Atlas.
Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
There is an outstanding problem (and GGUS ticket) affecting the certificate on the MyProxy server.

Ongoing Disk Server Issues

Notable Changes made this last week

On Wednesday (8 May) the Castor primary and standby databases were swapped over and Oracle Data Guard re-established between them.
On Thursday (9 May) seven new disk servers (630TB) were added to AtlasDataDisk.
A further six disk servers have been added to LHCbDst today (Wed 15 May).

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Re-establishing the paired (2*10Gbit) link to the UKLight router. (Aiming to do in next weeks).
Tuesday 21st May - Planned networking intervention at RAL.
The blocking issue regarding the Castor 2.1.13 upgrade has been resolved and the scheduling of this upgrade will proceed. (The non-Tier1 'Facilities' Castor instance has already been successfully upgraded.)

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
Grid Services
- Testing of alternative batch systems (SLURM, Condor).
- Upgrade of one remaining EMI-1 component (UI) being planned.
Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.

Entries in GOC DB starting between 8th and 15th May 2013.

There was one unscheduled outages during the last week for lcgce12 (CE for test SL6 queue).

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgce12.gridpp.rl.ac.uk	UNSCHEDULED	OUTAGE	10/05/2013 16:00	10/05/2013 16:51	51 minutes	HW failure
All Castor & Batch (CEs)	SCHEDULED	OUTAGE	08/05/2013 10:00	08/05/2013 12:00	2 hours	Stop of Castor storage system while primary and standby databases are switched over. During the stop no batch jobs will be started. Batch work already running may be paused (depending on the VO).

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
94049	Green	Urgent	In Progress	2013-05-14	2013-05-15	OPS	NAGIOS eu.egi.sec.Argus-EMI-1 failed on lcgargus01.gridpp.rl.ac.uk@RAL-LCG2
93870	Red	Less Urgent	In Progress	2013-05-06	2013-05-07	CMS	T1_UK_RAL squid upgrade
93149	Red	Less Urgent	On Hold	2013-04-05	2013-04-13	Atlas	RAL-LCG2: jobs failing with " cmtside command was timed out"
92266	Red	Less Urgent	Waiting for Reply	2013-03-06	2013-04-16		Certificate for RAL myproxy server
91658	Red	Less Urgent	On Hold	2013-02-20	2013-04-03		LFC webdav support
86152	Red	Less Urgent	On Hold	2012-09-17	2013-03-19		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
08/05/13	91.7	84.7	91.7	84.7	84.7	Scheduled Castor outage for the switch of the primary / standby databases.
09/05/13	100	100	100	100	100
10/05/13	100	100	100	100	100
11/05/13	100	100	98.2	100	100	SRM Put test failed with zero number of replicas.
12/05/13	100	100	100	100	100
13/05/13	100	94.0	100	100	100	Tests failed during restart of pbs_server (batch server).
14/05/13	100	100	100	100	100

Tier1 Operations Report 2013-05-15