RAL Tier1 Operations Report for 10th April 2013

The Post Mortem review of the failure of disk server GDSS594 (GenTape) in February that led to the loss of 68 T2K files has been completed. This can be seen at:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20130219_Disk_Server_Failure_File_Loss

Review of Issues during the week 3rd to 10th April 2013.

On Tuesday morning, 9th April, a planned intervention on the site networking ran into problems. The RAL site was disconnected from the rest of the world for around 100 minutes. The intervention had previously been announced as a scheduled 'Warning' in the GOCDB and the FTS drained. Internally Tier1 services carried on OK during the external break.
Two files were declared lost to Atlas following the failure of disk server GDSS454.

Resolved Disk Server Issues

GDSS454 (AtlasDataDisk D1T0) failed with a Red Only file system on Sunday 7th April. Following checks it was returned to service on Monday (8th). Two files that were being written at the time of the failure were declared lost to Atlas.

Current operational status and issues

The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
The problem LHCb and Atlas jobs failing due to long job set-up times remains and investigations continue.
The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
We are participating in xrootd federated access tests for Atlas.
Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
There is an outstanding problem (and GGUS ticket) affecting the cerificate on the MyProxy server.

Ongoing Disk Server Issues

None

Notable Changes made this last week

Updating disk controller firmware on Clustervision '11 batch of disk servers ongoing. (LHCb servers done this week).
Kernel/errata updates and removal of AFS software (as opposed to just disabling) being done across worker nodes.
New disk servers deployed in production (540TB to AtlasDataDisk; 720TB to CMSDisk).
One of the two batches of new worker nodes (the one from OCF) have been deployed into production.

Declared in the GOC DB

This evening (Wed 10th March 18:00 - 23:59 BST) Emergency maintenance affecting both the main and backup links to CERN. Site declared as 'Warning'.
Tomorrow (Thursday 11th April) Outage of LFC and FTS services (10:00 - 12:00). The Oracle database behind these services uses two disk arrays. One of the arrays is reporting errors and the database will be reconfigured (rebalanced) to move the data off the faulty array. FTS transfers will be drained before the outage.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

One of the disk arrays hosting the LFC/FTS/3D databases has given some errors. An intervention to move the 'somnus' (LFC & FTS) data off this array is planned for tomorrow. A further intervention will be required on the array itself which will affect the Atlas 3D service.
A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing. (Alice disk servers in this batch remain to be done).

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services
- Upgrade of other EMI-1 components (APEL, UI) under investigation.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check (will require some downtime).

Entries in GOC DB starting between 3rd and 10th April 2013.

There was one unscheduled outage (for the problematic network intervention) and one unscheduled warning (for emergency maintenance on the CERN OPN links) entries in the GOC DB for the last week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Whole Site	SCHEDULED	WARNING	10/04/2013 18:00	11/04/2013 00:00	6 hours	An emergency maintenance has been announced for both the main and backup OPN links RAL - CERN.
Whole Site	UNSCHEDULED	OUTAGE	09/04/2013 07:45	09/04/2013 09:25	1 hour and 40 minutes	Problem during planned network intervention broke connectivity to site. (Retrospective addition to GOC DB. Intervention originally delared as a Warning.)
Whole Site	SCHEDULED	WARNING	09/04/2013 07:30	09/04/2013 08:30	1 hour	At Risk around two short (few minute) breaks in external connectivity to the RAL Tier1 required for a network upgrade. Will drain FTS for an hour beforehand as a precaution.
Whole Site	UNSCHEDULED	WARNING	03/04/2013 18:00	04/04/2013 00:00	6 hours	An energency maintenance has been announced for both the main and backup OPN links RAL - CERN. No outage is expected during this maintenance. Services are considered at risk only.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
93149	Green	Less Urgent	On Hold	2013-04-05	2013-04-08	Atlas	RAL-LCG2: jobs failing with " cmtside command was timed out"
93136	Yellow	Less Urgent	In Progress	2013-04-05	2013-04-05	EPIC	Problems downloading job output using RAL WMS (epic VO)
92266	Red	Less Urgent	In Progress	2013-03-06	2013-04-09		Certificate for RAL myproxy server
91658	Red	Less Urgent	On Hold	2013-02-20	2013-04-03		LFC webdav support
91029	Red	Very Urgent	On Hold	2013-01-30	2013-02-27	Atlas	FTS problem in queryin jobs
86152	Red	Less Urgent	On Hold	2012-09-17	2013-03-19		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
20/03/13	100	100	100	100	100
03/04/13	100	100	100	100	99.3	Job cancelled/purged.
04/04/13	100	100	99.2	95.9	100	Atlas: Single SRM test failure "User timeout". CMS: Single SRM test failure "User timeout".
05/04/13	100	100	100	100	100
06/04/13	100	100	100	99.4	100	Single SRM test failure "User timeout" at very end of day (main effect is on the 7th).
07/04/13	100	100	100	92.5	100	Single SRM test failure "User timeout" at very end of day (main effect is on the 7th).
08/04/13	100	100	99.1	87.7	100	Atlas: 1 * "could not open connection to srm-atlas.gridpp.rl.ac.uk"; CMS: Total of three SRM test failures. 1 * "could not open connection to srm-cms.gridpp.rl.ac.uk"; 2 * "User timeout".
09/04/13	91.7	100	92.7	90.4	93.4	Problem during planned central networking intervention disconnected site for around 100 minutes.

Tier1 Operations Report 2013-04-10

RAL Tier1 Operations Report for 10th April 2013

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools