RAL Tier1 Operations Report for 8th February 2012

Review of Issues during the week 1st to 8th February 2012.

During last week, following the SRM update, the Atlas SRM was crashing. It was being restarted but there were periods when one of the SRMs was not responding. On Thursday (2nd Feb) a more aggressive re-starter was put in place, then on Friday (3rd) the cause was understood (to do with the method used to update the gridmap files). An effective temporary workaround has been put in place until the final fix is rolled out.
On Tuesday afternoon, 7th Feb, there was a network problem that lasted around an hour and particularly affected off-site connectivity to the Tier1. This was traced to a broadcast storm triggered by some equipment added elsewhere to the network.

Resolved Disk Server Issues

None.

Current operational status and issues.

The performance issue with the hardware being used to temporarily house the Castor databases has been largely alleviated by optimisations carried out by the database team. Furthermore, the cause of the Atlas SRM crashes has been understood and fixed. Whilst the performance of this disk array is lower than would be liked, this is a temporary arrangement and work is progressing on moving the Castor databases to their final hardware.
Since around Thursday (2nd Feb) LHCb have had pilot job submission problems caused by old jobs 'stuck' in the CEs. (Subject of GGUS ticket #78873 - see below)

Ongoing Disk Server Issues

None

Notable Changes made this last week

Castor Information Provider (CIP) updated on Monday (6th Feb.)
Network intervention (on C300 switch and network in UPS room) as well as re-positioning six racks of disk servers. (Ongoing today Wed 8th Feb.)
First set of move to new (UMD) batch server with non-LHC VOs being moved across. (Ongoing today Wed 8th Feb.)

Forthcoming Work & Interventions

Following the Castor Nameserver Upgrade to version 2.1.11 on Tuesday 14th February, it is planned to update the Castor stagers on:
- Monday 20 Feb CMS
- Wednesday 22 Feb ATLAS
- Monday 27 Feb LHCb
- Wednesday 29 Feb Gen

Declared in the GOC DB

Re-installation of WMS03 for upgrade to UMD distribution (ongoing)
MyProxy service being switched to new machine effectively upgrading from glite 3.1 to UMD versions. (At-Risk on Thursday 9th Feb.)
Upgrade of the Castor Nameserver to version 2.1.11 on Tuesday 14th February. This will take place at the same time as Atlas move their LFC from RAL to CERN. The Atlas TAGS system will also be moved during this time and the batch drain will be used to migrate LHC batch work to the new (UMD) batch server.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.

Infrastructure:
- Move part of the cooling system onto the UPS supply. (Should not require service interruption.) (Planned for Tuesday 28th February).
Databases:
- Regular Oracle "PSU" patches are pending.
- Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
Castor:
- Castor 2.1.11 upgrade (includes Replacement of hardware running Castor Head Nodes.)
- Move to use Oracle 11g.
Networking:
- Changes required to extend range of addresses that route over the OPN.
- Install new Routing & Spine layers.
Fabric:
- BIOS/firmware updates, Other re-configurations (adding IPMI cards, etc.)
- Network changes, including:
  - Changes to accommodate new networking equipment.
Grid Services:
- Updates of Grid Services (including WMS, FTS, LFC front ends) to EMI/UMD versions.

Entries in GOC DB starting between 1st and 8th February 2012.

There was one unscheduled outages during this period. This is for problems with the Atlas Castor instance on Monday (30th Jan.)

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Whole site	SCHEDULED	OUTAGE	08/02/2012 09:00	08/02/2012 16:00	7 hours	Outage for intervention on core network within the RAL Tier1.
All CEs (All batch)	SCHEDULED	OUTAGE	07/02/2012 21:00	08/02/2012 09:00	12 hours	Drain of batch system ahead of intervention on core network within the RAL Tier1.
lcgwms03	SCHEDULED	OUTAGE	07/02/2012 15:00	10/02/2012 15:00	3 days	System unavailable - EMI installation
All Castor	SCHEDULED	WARNING	06/02/2012 11:00	06/02/2012 14:00	3 hours	Update to the Castor Information Provider (CIP). This will not affect the Castor service directly, but will change some of the information reported to the BDIIs.
srm-lhcb	SCHEDULED	OUTAGE	01/02/2012 10:00	01/02/2012 12:00	2 hours	SRM Upgrade to version 2.11.

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
78873	Amber	Urgent	In Progress	2012-02-02	2012-02-07	LHCb	Redundant jobs at CREAM CEs at RAL
77026	Red	Less Urgent	Waiting Reply	2011-12-05	2012-02-03		BDII
74353	Red	very urgent	In Progress	2011-09-16	2012-02-01	Pheno	Proxy not renewing properly from WMS
68853	Red	less urgent	On hold	2011-03-22	2011-12-15		Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)

Tier1 Operations Report 2012-02-08

Contents

RAL Tier1 Operations Report for 8th February 2012

Review of Issues during the week 1st to 8th February 2012.

Resolved Disk Server Issues

Current operational status and issues.

Ongoing Disk Server Issues

Notable Changes made this last week

Forthcoming Work & Interventions

Declared in the GOC DB

Advanced warning for other interventions

Entries in GOC DB starting between 1st and 8th February 2012.

Open GGUS Tickets

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools