RAL Tier1 Operations Report for 14th November 2012

Review of Issues during the fortnight 31st October to 14th November 2012

On Sunday 4th Nov there was a problem with the database behind the Atlas SRM that led to an outage of the Atlas SRM for around six hours during the afternoon.
On Sunday morning (4th Nov) at around 05:00 the OPN link to CERN flipped such that routing was asymmetric with packets one way over the primary link, the other way over the backup. This was fixed on Monday (5th). Initially both directions were flipped to use the backup link both ways, then early afternoon the problem was fully resolved and data reverted to using the primary link both ways.
On Saturday a problem was reported with slow data export rates for Atlas from the Tier0 to RAL. The underlying cause of this was not found, although the problem was resolved by Monday. It is notable that this overlapped with periods of high data rates for Atlas on other links; the OPN issues referred to above as well as the SRM database problem on Sunday, also referred to above.
On Wednesday 7th November there was a power outage that affected the RAL site, and for which the backup power via a diesel generator did not work. The power cut happened at around 11:30 on Wednesday 7th. Core services (TopBDII, FTS) were returned to service at the end of that afternoon (although there was a subsequent problem with the FTS service that meant it was down overnight). All services (including Castor & Batch) were back around 14:00 the next day. A Post Mortem report is being prepared for this incident.
There was a problem that affected batch services on Saturday (10th Nov) owing to a problem updating CERN CRLs.
There was an outage of the Atlas SRM on Sunday (11th Nov) caused by a problem with the Atlas SRM database.
Over the weekend (10/11 Nov) there was a problem with some Castor disk servers not having time correctly synchronised that casued some Castor access failures.

Resolved Disk Server Issues

GDSS565 (AtlasDataDisk - D1T0) crashed on the morning of Thursday 1st Nov. It was restarted and checked out, being returned to service the following morning (2nd).
GDSS436 (AtlasDataDisk - D1T0) Failed with read only file system in the early hours of Friday (2nd Nov). It was returned to service on Saturday morning (3rd). One file was found corrupted and reported unrecoverable to Atlas.
GDSS443 (AtlasDataDisk - D1T0) Also failed with read only file system early on Friday (2nd Nov). Also returned to service on Saturday morning (3rd). Two files were found corrupted and reported unrecoverable to Atlas.
GDSS462 (AliceTape - D0T1) Failed with a read-only file system opn Monday evening (5th Nov). It was returned to service this morning (7th Nov).
GDSS420 (AliceTape - D0T1) reported a read only file system during the afternoon of Tuesday 6th November. After RAID verification (and a delay owing to a power cut) it was returned to production on Friday 9th Nov.
GDSS206, GDSS229, GDSS272, GDSS273 (all AtlasScratchDisk - D1T0). These machines had problems following the power outage. They were returned to service on Friday, 9th Nov.
GDSS647 (LHCbDst - D1T0) Reported a disk partition inaccessible on Friday 9th Nov. A disk was replaced and the system returned to production a couple of hours later.
GDSS437 (AtlasDataDisk - D1T0) Reported a read only file system in the early hours of Saturday (10th Nov). It was returned to production later that day. However, it reported several checksum errors and was taken out of production for additional checks on Tuesday (13th) - being returned to production again this morning (14th).

Current operational status and issues

Should there be another power outage the backup power via the diesel generator will not work. An investigation, hopefully a fix, and re-test is scheduled for Tuesday 20th Nov.
On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates. the opportunity has been taken to make further measurements when the network has been quiet after the power outage and scheduled network intervention.

Ongoing Disk Server Issues

None

Notable Changes made this last week

On Thursday (1st Nov) a change was made to the EMI CREAM CEs to increased FTP connections. This resolved a problem of intermittent SUM test failures.
On Thursday (1st Nov) a patch was applied to the FTS service that should cure the intermittent failures of the FTS system seen in recent weeks.
On Tuesday (6th Nov) a start was made on rolling out the use of hyperthreading on the worker nodes. SL09 machines are now running 10 jobs each (up from 8) Dell 11 machines (the original test batch) are now running 20 jobs each (up from 18).
On Tuesday morning, 13th Nov, Castor & batch services were suspended around a network interruption while a board was changed in a Network router. During the Castor stop the opportunity was taken to enable some statistics gathering on the Atlas SRM database and make a change resolve the DB problems behind this service.
On Tuesday morning, 13th Nov, there was a minor upgrade to the CIP (Castor Information Provider) to fix a problem of accounting for nearline storage (affects LHCb, T2K & SNO+).
CVMFS continues to be available for testing by non-LHC VOs (including "stratum 0" facilities).
Test instance of FTS version 3 continues to be available.

Declared in the GOC DB

None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

20th November: Intervention on the "Essential Power Board" and investigation into the panel that controls the diesel generator cutting in followed by UPS load test.
The continued roll-out of the use of hyperthreading on the worker nodes will take place.
Plans are advanced for a migration of worker nodes to EMI-2/SL5 and this will start soon.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- None
Networking:
- Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
- Improve the stack 13 uplink
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Enabling overcommit on WNs to make use of hyperthreading (will be implemented after the CE upgrades are complete).
- migration to EMI software for worker nodes.
Infrastructure:
- Intervention required on the "Essential Power Board" (scheduled for 20th November) & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.

Entries in GOC DB starting between 31st October and 14th November 2012

There were three unscheduled outages in the GOC DB for this period. One when there were problems with the Atlas SRM database (followed by an unscheduled warning period.) The other two were the result of the RAL site power cut.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
All CEs (batch) and Castor	SCHEDULED	OUTAGE	13/11/2012 08:30	13/11/2012 09:30	1 hour	Storage (Castor) and batch paused while network router card replaced.
lcglb01.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	10/11/2012 12:00	30/11/2012 14:00	20 days, 2 hours	host retirement
lcglb02.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	10/11/2012 12:00	30/11/2012 14:00	20 days, 2 hours	host retirement
All CEs (batch) and Castor	UNSCHEDULED	OUTAGE	08/11/2012 12:00	08/11/2012 14:00	2 hours	Storage (Castor) and batch services still down following yesterday's Power Outage.
Whole site	UNSCHEDULED	OUTAGE	07/11/2012 11:15	08/11/2012 12:00	1 day, 45 minutes	Power cut at RAL
srm-atlas	UNSCHEDULED	WARNING	04/11/2012 18:28	05/11/2012 12:00	17 hours and 32 minutes	At-risk on ATLAS SRM following the problems on Oracle DB
srm-atlas	UNSCHEDULED	OUTAGE	04/11/2012 12:00	04/11/2012 18:29	6 hours and 29 minutes	Outage while we investigate problems on the Oracle DB behind Atlas SRM
srm-lhcb	SCHEDULED	WARNING	31/10/2012 11:00	31/10/2012 12:00	1 hour	Swapping one the Castor headnodes back following repair after hardware failure.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
86690	Red	Urgent	In Progress	2012-10-03	2012-11-06	T2K	JPKEKCRC02 missing from FTS ganglia metrics
86152	Red	Less Urgent	On Hold	2012-09-17	2012-10-31		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
31/10/12	100	100	100	100	100
01/11/12	97.0	100	100	95.9	100	OPS - Monitoring problem in Regional Nagios; CMS - timeout.
02/11/12	100	100	95.6	87.7	91.7	Mainly "could not open connection to srm-cms..." errors that correlate with failures at other sites.
03/11/12	100	100	93.8	95.9	91.8	Almost all "could not open connection to srm-cms..." errors that correlate with failures at other sites.
04/11/12	100	100	75.6	100	100	Problem with Atlas SRM database.
05/11/12	100	100	95.8	95.9	100	All "could not open connection to srm-cms..." errors that correlate with failures at other sites.
06/11/12	100	100	100	100	100
07/11/12	60.6	95.0	57.4	57.8	55.6	Site-wide power cut.
08/11/12	39.9	47.9	39.2	34.4	46.2	Site-wide power cut.
09/11/12	100	100	100	100	100
10/11/12	85.2	94.3	100	62.7	100	Mainly CE test failures following problem updating CRLs.
11/11/12	100	100	84.0	79.4	100	Atlas: Problem with SRM Database; CMS: "user timeouts" in Castor.
12/11/12	100	100	89.1	95.9	100	Atlas: Config error stopped CRL update; CMS: "user timeout" in Castor.
13/11/12	95.8	97.5	95.8	91.4	97.5	Scheduled outage for network router board swap.

Tier1 Operations Report 2012-11-14

RAL Tier1 Operations Report for 14th November 2012

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools