RAL Tier1 Operations Report for 11th January 2012

Review of Issues during the week 4th to 11th January 2012.

On Thursday 5th Jan there was a planned intervention on the Castor databases and some rack moves that meant APEL & FTS services were unavailable for some time during the day. During the rack moves it had been necessary to power down the SRM nodes (amongst others). One of the Atlas SRM nodes failed to reboot and was removed from the DNS alias for srm-atlas.
On Friday 6th January there was a problem with the Atlas SRM. A reconfigured system was added to the atlas-srm alias around lunchtime but, owing to a misunderstanding, the appropriate firewall holes had not been opened for it. This was resolved during the afternoon.
On Tuesday 10th January there was planned update to two of the DNS servers at RAL. These two servers were used by the Tier1 but in preparation we had moved our DNS configurations on most systems to stop using these particular servers. However, there was an unexpected side effect of this change that appeared during the morning with reverse DNS lookups for Tier1 systems failing from outside RAL. It took some time to recognise the cause of the problems we were seeing (e.g. some, but not all, SAM tests failed). The problem itself was fixed late Tuesday evening. The RAL Tier1 downtime was ended on Wednesday morning.

Resolved Disk Server Issues

None

Current operational status and issues.

None.

Ongoing Disk Server Issues

None.

Notable Changes made this last week

Migration of the Castor databases to new hardware (Thursday 5th Jan). This is the first step in a series of changes to move all the Oracle databases to new hardware.
Errata & kernel updates applied to CEs & worker nodes. (Thursday 5th Jan)
Relocation of some racks of older equipment to make space in the computer room for new cooling arrangements. (Thursday 5th Jan)
Lcgui01 updated to UMD version of UI. (Monday 9th January.)
Update of final pair of DNS servers at RAL to new hardware. (Tuesday 10th January).

Forthcoming Work & Interventions

Monday/Tuesday 16/17 January: Atlas centrally down. We will be running various tests but services remain up.
During week 16-20 Jan: SRM updates to version 2.11.

Declared in the GOC DB

Monday 16th January (12:00 - 14:00) APEL update to UMD distribution.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. There is a significant amount of work that will require to be done during the LHC stop at the start of 2012.

Infrastructure:
- Move part of the cooling system onto the UPS supply. (Should not require service interruption.)
Databases:
- Regular Oracle "PSU" patches are pending.
- Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
Castor:
- Castor 2.1.11 upgrade.
- Replace hardware running Castor Head Nodes.
- Move to use Oracle 11g.
- Update to the Castor Information Porvider (CIP).
Networking:
- Changes required to extend range of addresses that route over the OPN.
- Re-configure networking in UPS room.
- Install new Routing & Spine layers.
- Final updates to the RAL DNS infrastructure (two DNS servers still to replace)
Fabric:
- BIOS/firmware updates, equipment moves in machine room (consolidation of equipment in racks; some rack moves.), Othere re-configurations (adding IPMI cards, etc.)
Grid Services:
- Updates of Grid Services (including LB, APEL, batch server) to UMD versions (mainly in new year).
VO:
- Migrate Atlas LFC to CERN.

Entries in GOC DB starting between 4th and 11th January 2012.

There were two unscheduled outages during this period. These were both relating to the DNS (reverse lookup) problem of Tuesday 10th January.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Whole site	UNSCHEDULED	OUTAGE	10/01/2012 17:00	11/01/2012 08:40	15 hours and 40 minutes	Putting whole site in downtime while we investigate network/DNS problems. Extending till 12:00, January 11 2012
Whole site	UNSCHEDULED	OUTAGE	10/01/2012 12:00	10/01/2012 17:00	5 hours	Putting whole site in downtime while we investigate network/DNS problems
lcgui01.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	09/01/2012 10:00	09/01/2012 13:00	3 hours	Outage to upgrade the UI middleware
All Castor (srm-endpoints), lcgapel0676, lcgftm, lcgfts	SCHEDULED	OUTAGE	05/01/2012 08:30	05/01/2012 16:00	7 hours and 30 minutes	Castor unavailable during migration of its Oracle databases to new hardware. Some re-organisation of racks in computer room at the same time affects APEL and FTS.
All CEs.	SCHEDULED	OUTAGE	04/01/2012 20:00	05/01/2012 16:00	20 hours	Batch unavailable (with drain beforehand) during intervention on Castor system.

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
78063	Green	Urgent	In Progress	2012-01-11	2012-01-11	CMS	Unable to create interface with rfcp-RAL protocol Protocol rfcp-RAL not supported or unknown
77026	Red	Less Urgent	In Progress	2011-12-05	2011-12-15		BDII
74353	Red	very urgent	In Progress	2011-09-16	2011-12-07	Pheno	Proxy not renewing properly from WMS
68853	Red	less urgent	On hold	2011-03-22	2011-12-15		Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077	Red	less urgent	in progress	2011-02-28	2011-09-20		Mandatory WLCG InstalledOnlineCapacity not published
64995	Red	less urgent	in progress	2010-12-03	2011-09-20		No GlueSACapability defined for WLCG Storage Areas

Tier1 Operations Report 2012-01-11

Contents

RAL Tier1 Operations Report for 11th January 2012

Review of Issues during the week 4th to 11th January 2012.

Resolved Disk Server Issues

Current operational status and issues.

Ongoing Disk Server Issues

Notable Changes made this last week

Forthcoming Work & Interventions

Declared in the GOC DB

Advanced warning for other interventions

Entries in GOC DB starting between 4th and 11th January 2012.

Open GGUS Tickets

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools