RAL Tier1 Operations Report for 7th March 2012

Review of Issues during the week 29th February to 7th March 2012.

The Castor database saw very heavy load from Atlas during the early hours of Saturday morning (3rd March) that looks to have been linked to a particular workload.
On Saturday (3rd March) there was a failure of the network link between the UKlight and SAR routers (this affects transfers to/from Tier2s) that lasted a bit under an hour around lunchtime. A fibre transceiver was replaced in this link on Tuesday morning (6th March) while the FTS down for a scheduled intervention.

Resolved Disk Server Issues

GDSS513 (LHCbDst - D1T0) was removed from production on Wednesday 29th Feb. following multiple drive failures. It was returned to service during the next morning (1st March.)

Current operational status and issues.

There is a known issue with the Atlas SRMs which is being investigated. A patched version has been rolled out that provides a workaround for the problem and stops the SRMs crashing. The remaining impact of this problem is minimal.

Ongoing Disk Server Issues

None.

Notable Changes made this last week

Thursday (1st March) One of the two batches of new worker nodes was moved to production.
Tuesday (6th March) Castor databases moved to final hardware configuration for main database, with Oracle Data Guard enabled to synchronize updates to the backup database.
Tuesday (6th March) FTS update to version 2.2.8 (still using Oracle 10).
Tuesday (6th March) Applied network routing change required to extend range of addresses that route over the OPN.

Forthcoming Work & Interventions

The Tier1 internal mail server ("Pat") will be replaced in the next couple of weeks.

Declared in the GOC DB

None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.

Databases:
- Regular Oracle "PSU" patches are pending for SOMNUS (LFC & FTS).
- Switch LFC/FTS/3D to new Database Infrastructure.
- Update LFC/FTS databases to Oracle 11.
Castor:
- Update the Castor Information Provider (CIP) (Need to re-schedule.)
- Move to use Oracle 11g (requires a minor Castor update.)
Networking:
- Install new Routing & Spine layers for Tier1 network.
- Main RAL network updates - early summer.
Fabric:
- BIOS/firmware updates, Other re-configurations (adding IPMI cards, etc.)
Grid Services:
- Updates of Grid Services (including WMS, MyProxy, LFC front ends) to EMI/UMD versions.

Entries in GOC DB starting between 29th February and 7th March 2012.

There were no unscheduled outages during this period.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
All Castor & CEs (batch).	SCHEDULED	OUTAGE	06/03/2012 10:00	06/03/2012 13:30	3 hours and 30 minutes	Castor Outage During Migration of Castor Oracle Databases to new hardware.
lcgfts.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	06/03/2012 08:00	06/03/2012 13:05	5 hours and 5 minutes	Upgrade to FTS 2.2.8. Will include starting with a fresh database so all channels drained and any transfers waiting in the ready queue will be lost.
castor GEN instance: srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k	SCHEDULED	OUTAGE	29/02/2012 08:00	29/02/2012 12:00	4 hours	Update of GEN Castor instance to version 2.1.11-8

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
79867	Green	Less Urgent	In Progress	2012-03-04	2012-03-07	SNO+	snoplus.snolab.ca LFC
79428	Red	Less Urgent	Waiting Reply	2012-02-21	2012-03-07	SNO+	glite-wms-job aborted
74353	Red	Very Urgent	In Progress	2011-09-16	2012-03-02	Pheno	Proxy not renewing properly from WMS
68853	Red	less urgent	On hold	2011-03-22	2012-03-02		Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)

Tier1 Operations Report 2012-03-07

Contents

RAL Tier1 Operations Report for 7th March 2012

Review of Issues during the week 29th February to 7th March 2012.

Resolved Disk Server Issues

Current operational status and issues.

Ongoing Disk Server Issues

Notable Changes made this last week

Forthcoming Work & Interventions

Declared in the GOC DB

Advanced warning for other interventions

Entries in GOC DB starting between 29th February and 7th March 2012.

Open GGUS Tickets

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools