RAL Tier1 Operations Report for 24th October 2012

Review of Issues during the week 17th to 24th October 2012

During planned maintenance the OPN link to CERN failed over to the backup route from around 07:30 until 17:30 on Saturday 20th October.
During the rebooting of the LHCb disk servers while the Castor instance was being upgraded one of the disk servers re-installed itself as another disk server. No data was lost, but the server was out of production until later that afternoon and then a further fault was found and fixed the following morning.
During the afternoon of Tuesday 23rd Oct. one of the LHCb Castor headnodes showed a significant hardware fault and was replaced.
The FTS service failed (with a known bug) early yesterday evening (23rd Oct). The test for this failed to detect the problem and the service was down for most VOs until around 9am this morning (24th).

Resolved Disk Server Issues

GDSS454 (AtlasDataDisk - D1T0) failed on 16th Oct. It was returned to production during the afternoon of 17th October. As reported at the last meeting one file was declared lost from this server.
GDSS639 (GENScratchDisk - D0T0) failed on Saturday morning (20th Oct). It was returned to production on Monday afternoon (22nd Oct) after faulty memory had been replaced.
GDSS213 (AtlasScratchDisk - D1T0) failed on Sunday afternoon (21st Oct). It was returned to production on Monday afternoon (22nd Oct).
GDSS535 (LHCbDst - D1T0) The system was re-installed as another node when rebooted during the LHCb Castor upgrade on Tuesday 23rd Oct. It was returned to production later that afternoon. However, a further problem was found on this server which was fixed during the following morning (24th).

Current operational status and issues

At the moment we are failing the VO SUM tests for the CEs for a number of VOs. This reflects tests that have not yet moved to the new EMI CEs.
On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage). Ongoing work by Fabric team looking to improve the uplink.
Investigations are ongoing (using perfsonar) into asymmetric routing of data over (and not back over) the OPN. A problem has been resolved with routing from CNAF. The problem also appears with the North American Tier1 sites and is being followed up.

Ongoing Disk Server Issues

None

Notable Changes made this last week

WMS01 updated to EMI v3.3.8
On 19th Oct an update to the castor information provider removed some unnecessary references to glite and fixed a problem of tape usage reporting.
23rd Oct - LHCb Castor instance upgraded to version 2.1.12-10.
23rd October glite CREAM CEs replaced with EMI CREAM CEs.
Hyperthreading continues to run on one batch of worker nodes ahead of it being rolled out on all suitable worker nodes.
As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5. In addition a further ten nodes (one from each hardware generation/batch) installed with EMI-2/SL5 are running as part of the normal batch system.
Test instance of FTS version 3 available. The non-LHC VOs that use the existing service have been enabled on it and we are looking for one of the VOs to test it.

Declared in the GOC DB

Ongoing WMS02 update to EMI v3.3.8
Tuesday 30th October: Upgrade of GEN Castor instance to Version 2.1.12-10.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

20th November: Intervention required on the "Essential Power Board" and transformers. (An "At Risk").

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.12. (As detailed above).
Networking:
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- CEs being upgraded to EMI version now.
- Rolling upgrade of WMSs to EMI version underway.
- Enabling overcommit on WNs to make use of hyperthreading (will be implemented after the CE upgrades are complete).

Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)

Infrastructure:
- Intervention required on the "Essential Power Board".
- Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.

Entries in GOC DB starting between 17th and 24th October 2012

There are two unscheduled outages in the GOC DB for this period. One is for the failure of one of the LHCb Castor headnodes, the other is for the new EMI CREAM CEs (not in production at that time).

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-lhcb	UNSCHEDULED	WARNING	23/10/2012 16:30	24/10/2012 12:30	20 hours	At risk due to hardware fault on castor headnode. Services are being moved to alternative hardware.
lcgce01, lcgce02, lcgce04, lcgce10, lcgce11	SCHEDULED	WARNING	23/10/2012 10:00	24/10/2012 12:00	1 day, 2 hours	post EMI-2 CREAM migration
lcgce03, lcgce05, lcgce07, lcgce08, lcgce09	SCHEDULED	OUTAGE	23/10/2012 09:00	30/11/2012 12:00	38 days, 4 hours	replacement with EMI-2 CREAM nodes
srm-lhcb	SCHEDULED	OUTAGE	23/10/2012 08:00	23/10/2012 10:50	2 hours and 50 minutes	Upgrade of LHCb Castor instance to Version 2.1.12-10
lcgwms02	SCHEDULED	OUTAGE	21/10/2012 10:00	26/10/2012 13:00	5 days, 3 hours	EMI WMS upgrade to v3.3.8
lcgce01, lcgce02, lcgce04, lcgce10, lcgce11	UNSCHEDULED	OUTAGE	19/10/2012 15:00	23/10/2012 10:00	3 days, 19 hours	migration to EMI-2 CREAM
lcgwms01	SCHEDULED	OUTAGE	19/10/2012 13:00	22/10/2012 15:00	3 days, 2 hours	EMI WMS upgrade to v3.3.8
lcgwms01	SCHEDULED	OUTAGE	17/10/2012 15:00	19/10/2012 13:00	1 day, 22 hours	EMI WMS update to v3.3.8
lcgwms01	SCHEDULED	OUTAGE	12/10/2012 10:00	17/10/2012 15:00	5 days, 5 hours	EMI WMS update to v3.3.8

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
86705	Red	Less Urgent	In Progress	2012-10-03	2012-10-23	SNO+	RAL jobs returning errors
86690	Red	Urgent	In Progress	2012-10-03	2012-10-22	T2K	JPKEKCRC02 missing from FTS ganglia metrics
86152	Red	Less Urgent	In Progress	2012-09-17	2012-10-22		correlated packet-loss on perfsonar host
68853	Red	Less Urgent	In Progress	2011-03-22	2012-10-23	N/A	Retirenment of SL4 and 32bit DPM Head nodes and Servers

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
17/10/12	96.0	100	100	100	100	CE07 had a problem (according to tests). This coincided with a block of missing data.
18/10/12	100	100	100	100	100
19/10/12	100	100	100	100	100
20/10/12	100	100	99.1	100	100	Single failure of SRM Put at 07:46 ("zero number of replicas");
21/10/12	100	100	98.2	100	100	Failures of SRM Get at 02:05 & 02:19 ("could not open connection to srm-atlas.gridpp.rl.ac.uk")
22/10/12	100	100	100	100	100
23/10/12	92.6	33.3	33.3	82.0	29.2	Mainly effect of replacing glite CREAM CEs with EMI CREAM CEs. Some effect on LHCb from castor upgrade.

Tier1 Operations Report 2012-10-24

RAL Tier1 Operations Report for 24th October 2012

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools