RAL Tier1 Operations Report for 4th January 2012

The report covers the Christmas period during which Tier1 operations continued smoothly.

Review of Issues during the two weeks 21st December 2011 to 4th January 2012.

During the afternoon of Wednesday 21st December, between 13:09 and 13:45 there was a break in the network link between the Site Access Router (SAR) and the UKLight Router. This affected data transfers that do not go over the OPN.
This same link was again down from 01:30 to 08:45 during the morning of Thursday 22nd December. Again data transfers that do not go over the OPN were affected. (Retrospectively added to the GOCDB as an unscheduled warning).
On Wednesday/Thursday 28/29th December regular checks detected a high rate of failures for accesses to AliceDisk. Remedial action was taken (The number of xrootd job slots were changed from 50 to 100 to accommodate all requests which were timing out).
We have reported two lost files to LHCb in the first couple of days back after the holiday. These were separate incidents. One was picked up by the checksum checker (and followed the failure of GDSS463), the other was picked up following a failing FTS transfer and was recorded as having a size of zero in the Castor database.
A problem with CVMFS reported by Atlas that was present over the holiday period was traced to a failure of some replication at CERN.

Resolved Disk Server Issues

GDSS332 (LHCbDst- D1T0) which was reported as having failed during the morning of Wednesday 21st December was returned to production during the afternoon of that day.
GDSS307 (CMSWanIn - D0T1) failed with a read-only file system on the evening of 26th December. It was returned to production the following morning following an intervention on site.
GDSS463 (LHCbDst - D1T0) failed with a read-only file system on the afternoon of 31st December. It was returned to production around lunchtime the following day.

Current operational status and issues.

None.

Ongoing Disk Server Issues

None.

Notable Changes made this last fortnight

None

Forthcoming Work & Interventions

Tuesday 10th January: Update of final pair of DNS servers at RAL to new hardware.

Declared in the GOC DB

Thursday 5th January: First step of migration of Oracle database infrastructure. This is the move of Castor databases and will cause an outage of Castor for some hours (with a batch drain beforehand.) The opportunity will also be taken to move some racks of older nodes (required to create space for new deliveries) and apply some patches to worker nodes.
Monday 9th January. Lcgui01 will be updated to UMD version of UI.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. There is a significant amount of work that will require to be done during the LHC stop at the start of 2012.

Infrastructure:
- Move part of the cooling system onto the UPS supply. (Should not require service interruption.)
Databases:
- Regular Oracle "PSU" patches are pending.
- Switch Castor and LFC/FTS/3D to new Database Infrastructure.
Castor:
- Castor 2.1.11 upgrade.
- SRM 2.11 upgrade
- Replace hardware running Castor Head Nodes.
- Move to use Oracle 11g.
- Update to the Castor Information Porvider (CIP).
Networking:
- Changes required to extend range of addresses that route over the OPN.
- Re-configure networking in UPS room.
- Install new Routing & Spine layers.
- Final updates to the RAL DNS infrastructure (two DNS servers still to replace)
Fabric:
- BIOS/firmware updates, equipment moves in machine room (consolidation of equipment in racks; some rack moves.), Othere re-configurations (adding IPMI cards, etc.)
Grid Services:
- Updates of Grid Services (including LB, APEL, batch server) to UMD versions (mainly in new year).
VO:
- Address permissions problem regarding Atlas User access to all Atlas data.
- Migrate Atlas LFC to CERN.

Entries in GOC DB starting between 21st December 2011 and 4th January 2012.

There was one unscheduled outage during this period. This was during the failure of the SAR-UKLight router network link.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
All CEs (all batch)	SCHEDULED	OUTAGE	04/01/2012 20:00	05/01/2012 16:00	20 hours	Batch unavailable (with drain beforehand) during intervention on Castor system.
All Castor storage.	UNSCHEDULED	WARNING	22/12/2011 01:30	22/12/2011 08:45	7 hours and 15 minutes	Service degradation at RAL for all SRMs. A network problem caused some file transfers to fail at RAL.

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
77739	Red	Less urgent	In Progress	2011-12-25	2011-12-25	CMS	[sr #125424] T1_UK_RAL Job Robot error
77528	Red	Less urgent	In Progress	2011-12-16	2011-12-16	H1	hone jobs cannot be submitted through lcgwms03.gridpp.rl.ac.uk wms-server
77026	Red	Less Urgent	On Hold	2011-12-05	2011-12-15		BDII
74353	Red	very urgent	In Progress	2011-09-16	2011-12-07	Pheno	Proxy not renewing properly from WMS
68853	Red	less urgent	On hold	2011-03-22	2011-12-15		Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077	Red	less urgent	in progress	2011-02-28	2011-09-20		Mandatory WLCG InstalledOnlineCapacity not published
64995	Red	less urgent	in progress	2010-12-03	2011-09-20		No GlueSACapability defined for WLCG Storage Areas

Tier1 Operations Report 2012-01-04

Contents

RAL Tier1 Operations Report for 4th January 2012

Review of Issues during the two weeks 21st December 2011 to 4th January 2012.

Resolved Disk Server Issues

Current operational status and issues.

Ongoing Disk Server Issues

Notable Changes made this last fortnight

Forthcoming Work & Interventions

Declared in the GOC DB

Advanced warning for other interventions

Entries in GOC DB starting between 21st December 2011 and 4th January 2012.

Open GGUS Tickets

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools