RAL Tier1 Operations Report for 19th December 2012

Review of Issues during the week

12th to 19th December 2012

On Tuesday (18th December) there was failure of one of the site routers that took the entire Tier1 off-air at 06:45. The Router was fixed and the configuration verified around 3 hours later. Following this there was a period of verifying various systems and connection within the Tier1 and the outage was ended in the GOC DB at 10:45. Some problems were reported with the batch system after this and these were resolved finally around 15:00.

Resolved Disk Server Issues

GDSS443 (AtlasDataDisk - D1T0) failed with a read only filesystem on Thursday 13th Dec. It was returned to production the next day. One disk was found to be faulty.
GDSS449 (AtlasDataDisk - D1T0) failed with a read only filesystem on Sunday 16th Dec. It was returned to production the next day.

Current operational status and issues

We have seen an increasing rate of failures on one of the '08 batches of disk servers. A program of upgrading the disk controller firmware in this batch is under way.
The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. One half of the new switchboard has been refurbished and was brought into service on 17 September. The work on the second is over-running slightly with an estimated completion of date of 13th January. (Original date was 18th Dec.)
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates - a problem which disappears when we are not loading the network.

Ongoing Disk Server Issues

GDSS447 (AtlasDataDisk - D1T0) failed with a read only filesystem last night and is undergoing investigation.

Notable Changes made this last week

On Monday (17th Dec) the Castor Information provider was upgraded to fix an issue where one of LHCb's paths was showing as undefined.
The Post mortem report for the Power Incident on 20th November has been prepared and is available at: RAL_Tier1_Incident_20121120_UPS_Over_Voltage

Declared in the GOC DB

None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
- Improve the stack 13 uplink
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.

Entries in GOC DB starting between 12th and 19th December 2012

There were four unscheduled outages in the GOC DB for this period. Three were for the problem with the Atlas SRMs last week (Wed 12th Dec). The other was the site outage caused by the Network Router failure yesterday morning (18th Dec.)

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Whole site	UNSCHEDULED	OUTAGE	18/12/2012 06:45	18/12/2012 10:45	4 hours	Hardware failure in core site network has taken RAL Tier1 off-air.
srm-atlas.gridpp.rl.ac.uk,	UNSCHEDULED	OUTAGE	12/12/2012 13:30	12/12/2012 14:57	1 hour and 27 minutes	Ongoing problem with Atlas SRM being investigated.
srm-atlas.gridpp.rl.ac.uk,	UNSCHEDULED	OUTAGE	12/12/2012 11:45	12/12/2012 13:30	1 hour and 45 minutes	Ongoing problems with Atlas SRM.
srm-atlas.gridpp.rl.ac.uk,	UNSCHEDULED	OUTAGE	12/12/2012 10:30	12/12/2012 11:45	1 hour and 15 minutes	There are problems with the Atlas srm Database.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
89733	Red	Urgent	In Progress	2012-12-17	2012-12-18		RAL bdii giving out incorrect information
86152	Red	Less Urgent	On Hold	2012-09-17	2012-10-31		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
12/12/12	100	100	80.6	100	100	Problems with Atlas SRM.
13/12/12	100	98.6	100	100	100	Timeout for the job exceeded.
14/12/12	100	100	100	100	100
15/12/12	100	100	100	100	100
16/12/12	100	100	100	95.9	100	Single SRM test failure "user timeout".
17/12/12	100	100	99.2	100	100	Single error while deleting test file.
18/12/12	71.2	76.0	63.7	64.7	87.5	Site Network problem (Router A failure) followed by some CE problems.

Tier1 Operations Report 2012-12-19

RAL Tier1 Operations Report for 19th December 2012

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools