RAL Tier1 Operations Report for 1st February 2012

Review of Issues during the week 25th January to 1st February 2012.

On Thursday (26th Jan) There had been problems running Alice batch jobs for the last day or so. Solution was to disable CVMFS for Alice as this was making some old versions of software available to the worker nodes.
Over the weekend we were failing to run LHCb batch jobs (and failing SAM tests) owing to a configuration error after moving LHCb away from the high memory (6GB) queue.
On Monday (30th Jan) there were problems with the Atlas Castor instance. There was a successful update of the Atlas SRM. However, by coincidence the Atlas Castor stager database suffered a fragmentation problem. This badly affected performance and an outage was declared for the Atlas SRM for four hours.

Resolved Disk Server Issues

On Sunday (29 Jan) GDSS336 (GEN tape - D0T1) failed with a Read Only File system. This system is in the process of being withdrawn from service anyway.
On Monday (30th Jan) GDSS445 (AtlasDataDisk - D1T0) was out of service for a little under two hours while memory was replaced.

Current operational status and issues.

We are seeing some problems with the Atlas Castor instance. The SRM update has fixed a problem of database deadlocks, however this has exposed a performance issue with the hardware being used to temporarily house the Castor databases.

Ongoing Disk Server Issues

None

Notable Changes made this last week

GEN SRM upgraded to Version 2.11 (Thursday 26th Jan.)
Atlas SRM upgraded to Version 2.11 (Monday 30th Jan.)
Move of T10KB tape servers (only used by CMS) (Tuesday 31st Jan.)
Upgrade of LHCb databases (3D & LFC) to Oracle 11. (Tuesday 31st Jan.)
LHCb SRM upgraded to Version 2.11 (Wednesday 1st Feb.)

Forthcoming Work & Interventions

Network intervention to move the main C300 switch & re-configure networking in UPS room. Site Outage on Wednesday 8th Feb.
Atlas plan to move their LFC from RAL to CERN on Tuesday 14th February.
Castor 2.1.11 upgrade. This will start with the Nameserver update on Tuesday 14th February. (Will require Castor down.)
- This will be followed by outages of the individual Castor instances as they are upgraded during the following week or so. (Proposed dates: CMS - Thu 16th Feb; Atlas - Mon 20th Feb; LHCb & GEN - Wed 22nd Feb.)
Updates to main batch server (Will require Farm drain) and MyProxy to be scheduled.
We are investigating offering a shared disk pool to all non-LHC VOs.

Declared in the GOC DB

Re-installation of WMS03 for upgrade to UMD distribution (ongoing)

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.

Infrastructure:
- Move part of the cooling system onto the UPS supply. (Should not require service interruption.) (Planned for Tuesday 28th February).
Databases:
- Regular Oracle "PSU" patches are pending.
- Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
Castor:
- Castor 2.1.11 upgrade (includes Replacement of hardware running Castor Head Nodes.)
- Move to use Oracle 11g.
Networking:
- Changes required to extend range of addresses that route over the OPN.
- Install new Routing & Spine layers.
Fabric:
- BIOS/firmware updates, equipment moves in machine room (consolidation of equipment in racks; some rack moves.), Other re-configurations (adding IPMI cards, etc.)
- Network changes, including:
  - Changes to accommodate new networking equipment.
Grid Services:
- Updates of Grid Services (including WMS, batch server, myProxy, FTS, LFC front ends) to EMI/UMD versions.

Entries in GOC DB starting between 25th January and 1st February 2012.

There was one unscheduled outages during this period. This is for problems with the Atlas Castor instance on Monday (30th Jan.)

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-lhcb.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	01/02/2012 10:00	01/02/2012 12:00	2 hours	SRM Upgrade to version 2.11.
lhcb-lfc.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	31/01/2012 09:00	31/01/2012 15:30	6 hours and 30 minutes	Update of Oracle software used by LHCb LFC and 3D databases.
lcgce09.gridpp.rl.ac.uk, srm-atlas.gridpp.rl.ac.uk,	UNSCHEDULED	OUTAGE	30/01/2012 16:45	30/01/2012 20:45	4 hours	Outage while we investigate problems on ATLAS SRM
lcgwms03.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	30/01/2012 10:00	07/02/2012 12:00	8 days, 2 hours	System unavailable for drain then re-installation of operating system.
srm-atlas.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	30/01/2012 10:00	30/01/2012 12:00	2 hours	Update of Atlas SRM to version 2.11
Castor GEN instance (srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k)	SCHEDULED	OUTAGE	26/01/2012 10:00	26/01/2012 11:25	1 hour and 25 minutes	Update to SRM 2.11 for Castor GEN instance.

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
77026	Red	Less Urgent	Waiting Reply	2011-12-05	2012-01-16		BDII
74353	Red	very urgent	In Progress	2011-09-16	2012-01-31	Pheno	Proxy not renewing properly from WMS
68853	Red	less urgent	On hold	2011-03-22	2011-12-15		Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077	Red	less urgent	In Progress (Reset status after automatic ticket closing by error)	2011-02-28	2012-01-26		Mandatory WLCG InstalledOnlineCapacity not published
64995	Red	less urgent	In Progress (Reset status after automatic ticket closing by error)	2010-12-03	2012-01-26		No GlueSACapability defined for WLCG Storage Areas

Tier1 Operations Report 2012-02-01

Contents

RAL Tier1 Operations Report for 1st February 2012

Review of Issues during the week 25th January to 1st February 2012.

Resolved Disk Server Issues

Current operational status and issues.

Ongoing Disk Server Issues

Notable Changes made this last week

Forthcoming Work & Interventions

Declared in the GOC DB

Advanced warning for other interventions

Entries in GOC DB starting between 25th January and 1st February 2012.

Open GGUS Tickets

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools