RAL Tier1 Operations Report for 31st October 2012

Review of Issues during the week 24th to 31st October 2012

During the afternoon of Tuesday 23rd Oct. one of the LHCb Castor headnodes showed signs of an impending significant hardware fault and was replaced with a hot spare before it failed. Following the vendor fixing the hardware the original system was swapped back in this morning (Wed 31st Oct.)
Around 04:00 there on the morning of Thursday 25th Oct. a fault was reported on the Alice VO box (lcgvo-alice). This was fixed when staff arrived at work the next morning.
Some problems with the site firewall caused short breaks in connectivity through this route on both Monday and Tuesday mornings for around 10 to 15 minutes each time. The cause of this has been understood.
The primary OPN link to CERN failed and we automatically switched to the backup around 09:15 Tuesday morning (30th). The cause was a fibre break during road works in France. The problem was fixed and we reverted to the primary link around 19:00 the same day.

Resolved Disk Server Issues

None.

Current operational status and issues

On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage). Aim to resolve this at same time as network outage on 13th November.
Investigations are ongoing (e.g. using perfsonar) into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates.
A fault has been found in a the card that connects the Tier1 to one of the main RAL routers ("Router A") and requires replacement. (Scheduled for 13th November).
The new EMI CREAM CEs are bedding in. Some intermittent SUM test failures are being followed up. Checks are being made for any remaining jobs that still arrive via the old glite CEs.

Ongoing Disk Server Issues

None

Notable Changes made this last week

WMS02 updated to EMI v3.3.8. This completes the updates of the three WMSs software.
The routing of network packets back from North American Tier1s (BNL, FerminLab, Triumph) has been corrected to use the OPN rather than other production networks.
30th Oct - Castor GEN instance upgraded to version 2.1.12-10. This completes the Castor 2.1.12 upgrade.
Hyperthreading continues to run on one batch of worker nodes and will be rolled out on all suitable worker nodes once the CE changes have bedded in.
As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5. In addition a further ten nodes (one from each hardware generation/batch) installed with EMI-2/SL5 are running as part of the normal batch system.
Test instance of FTS version 3 available. The non-LHC VOs that use the existing service have been enabled on it and we are looking for one of the VOs to test it.

Declared in the GOC DB

None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

13th November: Intervention on network router card. Aim to use this time to also improve the stack 13 uplink and possibly carry out further tests to find the cause of the poor outbound data rates.
20th November: Intervention required on the "Essential Power Board" and transformers. (An "At Risk").

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- None
Networking:
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Enabling overcommit on WNs to make use of hyperthreading (will be implemented after the CE upgrades are complete).
- migration to EMI software for worker nodes.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers. (Scheduled for 20th November).
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.

Entries in GOC DB starting between 24th and 31st October 2012

There was one unscheduled outage in the GOC DB for this period when one of the LHCb Castor headnodes showed hardware errors shortly after the LHCb Castor upgrade.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-lhcb.gridpp.rl.ac.uk,	SCHEDULED	WARNING	31/10/2012 11:00	31/10/2012 12:00	1 hour	Swapping one the Castor headnodes back following repair after hardware failure.
castor GEN Instance (srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k)	SCHEDULED	OUTAGE	30/10/2012 08:00	30/10/2012 11:10	3 hours and 10 minutes	Upgrade of Castor GEN instance to Version 2.1.12-10.
srm-lhcb.gridpp.rl.ac.uk,	UNSCHEDULED	WARNING	23/10/2012 16:30	24/10/2012 12:30	20 hours	At risk due to hardware fault on castor headnode. Services are being moved to alternative hardware.
New EMI CREAM CEs: (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11)	SCHEDULED	WARNING	23/10/2012 10:00	24/10/2012 12:00	1 day, 2 hours	post EMI-2 CREAM migration
Old Glite CREAM CEs (lcgce03, lcgce05, lcgce07, lcgce08, lcgce09)	SCHEDULED	OUTAGE	23/10/2012 09:00	30/11/2012 12:00	38 days, 4 hours	Replacement with EMI-2 CREAM nodes
lcgwms02.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	21/10/2012 10:00	25/10/2012 10:20	4 days, 20 minutes	EMI WMS upgrade to v3.3.8

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
86690	Red	Urgent	In Progress	2012-10-03	2012-10-31	T2K	JPKEKCRC02 missing from FTS ganglia metrics
86152	Red	Less Urgent	In Progress	2012-09-17	2012-10-30		correlated packet-loss on perfsonar host
68853	Red	Less Urgent	In Progress	2011-03-22	2012-10-30	N/A	Retirenment of SL4 and 32bit DPM Head nodes and Servers

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
24/10/12	100	0	0	100	0	New EMI CEs not in tests for all VOs.
25/10/12	100	52.4	29.1	100	46.6	New EMI CEs appeared in tests during this day.
26/10/12	100	100	96.8	95.8	100	CMS - Single failure of SRM test. Atlas - appears spurious.
27/10/12	100	100	100	100	100
28/10/12	100	100	100	99.0	100	Single failure of SRM Put: User timeout over
29/10/12	100	100	100	100	100
30/10/12	100	100	98.6	100	100	Failures on both monitored CEs. (No compatible resources returned by BDII.)

Tier1 Operations Report 2012-10-31

RAL Tier1 Operations Report for 31st October 2012

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools