RAL Tier1 Operations Report for 27th June 2012

Review of Issues during the fortnight 13th to 27th June 2012

The update of the Oracle database behind the FTS and (non-LHC VO's) LFC on Wednesday 13th June ran into problems. The FTS service was restored during the afternoon but the FTS database had to be re-created which caused the queue of transfers to be lost. The LFC database was in the end successfully moved to an Oracle 11 database but it was not available again until the morning of Friday 15th June. This issue has been the subject of a post mortem - the report can be seen at https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20120613_Oracle11_Update_Failure
On Friday (15th June) there was a hardware failure on one of the Castor headnodes for the GEN instance. The functionality was moved across to one of the other headnodes during the afternoon. A short outage was declared for the Castor GEN instance.
There was a problem with the routing of packets from the OPN to RAL. These were routed over the production networks from 13-21st June. (Outbound packets were routed correctly over the OPN link). This was triggered by a change to complete the extension of the address range being used for our disk servers. This caused a problem with file transfers between RAL and FZK only. (This particular problem was worked around (by FZK) on 18th.)
There were problems with CMS SUM tests and CMS FTS transfers over the weekend of 23/24 June. These were caused by two problems - one resolved by restarting the SRMs, the other by changing a poor execution plan in the database.
There was a spate of SUM test failures for the CEs In the morning of Tuesday 26th June. These were caused by the recurrence of a problem seen between teh CEs and the batch server. These were fixed by restarting torque/maui on the batch server.

Resolved Disk Server Issues

None

Current operational status and issues

On Tuesday 12th June the first stage of switching ready for the work on the main site power supply took place. This initial switching should be completed today (13th June). The work on the two transformers is expected to take until 18th December.
There is still a problem with the reporting of disk capacity to be followed up.

Ongoing Disk Server Issues

GDSS607 (LHCbDst - D1T0) has been drained and, following further problems is still undergoing re-acceptance testing following earlier failures. Comment from Kash: Machine appears to be finally working well (after firmware update + replacement hardware), has passed 6 days of acceptance testing and will be handed back tomorrow.

Notable Changes made this last week

Castor 2.1.11-9 update completed (Wed 13th June).
The Database behind the FTS and non-LHC VO's LFC ("Somnus") has been updated to Oracle 11. (13-15 June)
The RAL site access router was replaced as part of a significant upgrade on Tues 19th June.

Declared in the GOC DB

Castor Oracle 11 update. (Wed 27th June).
Move the FTS database back to the 'Somnus' RAC (Wed 27th June).
Drain and re-install of WMS02 for EMI installation WMSv3.3.5 (Thu 21 - Wed 27 June).

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

The FTS Agents are being progressively moved to virtual machines.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.12.
Networking:
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)
Infrastructure:
- The electricity supply company plan to work on the main site power supply for 6 months commencing 18th June. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.

Entries in GOC DB starting between 13th and 27th June 2012

There were four unscheduled outages during the last week: Two were extensions to the outage of the (non-LHC-VO's) LFC during the Oracle 11 upgrade, one was the failure of a Castor (GEN) headnode and the fourth an extension to the downtime for the major site networking upgrade.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
All Castor (All SRM endpoints) and CEs.	SCHEDULED	OUTAGE	27/06/2012 08:45	27/06/2012 14:30	5 hours and 45 minutes	Storage (Castor) and Batch (CEs) unavailable. Oracle database behind Castor being moved to Oracle 11.
lcgfts.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	27/06/2012 07:45	27/06/2012 11:00	3 hours and 15 minutes	Service drained then unavailable while back end (Oracle) database moved back to correct Oracle RAC.
lcgwms02.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	21/06/2012 12:00	27/06/2012 13:00	6 days, 1 hour	EMI installation WMSv3.3.5
Whole Site	UNSCHEDULED	OUTAGE	19/06/2012 11:00	19/06/2012 13:00	2 hours	Intervention on site network connection continuing.
Whole site	SCHEDULED	OUTAGE	19/06/2012 08:00	19/06/2012 11:00	3 hours	Site unavailable while network access routers upgraded.
castor GEN instance (srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k,	UNSCHEDULED	OUTAGE	15/06/2012 15:30	15/06/2012 16:35	1 hour and 5 minutes	hardware failure has caused a castor outage. The faulty hardware is being replaced
lfc.gridpp.rl.ac.uk	UNSCHEDULED	OUTAGE	14/06/2012 17:00	15/06/2012 09:12	16 hours and 12 minutes	Further extension of outage of LFC used by ILC, T2K, MICE, MINOS, SNO following problems with Oracle 11 update.
lcgwms01.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	14/06/2012 12:00	19/06/2012 09:30	4 days, 21 hours and 30 minutes	database maintenance and service re-configuration
lfc.gridpp.rl.ac.uk	UNSCHEDULED	OUTAGE	13/06/2012 17:00	14/06/2012 17:00	24 hours	Extension of outage of LFC used by ILC, T2K, MICE, MINOS, SNO following problems with Oracle 11 update.
All Castor (All SRM endpoints) and CEs.	SCHEDULED	OUTAGE	13/06/2012 08:00	13/06/2012 11:55	3 hours and 55 minutes	Castor and Batch (CEs) unavailable during Castor update to version 2.1.11-9
lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk	SCHEDULED	OUTAGE	13/06/2012 08:00	13/06/2012 17:00	9 hours	FTS and LFC services down for update of back-end database to Oracle 11.

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
83578	Green	Urgent	Waiting Reply	2011-06-26	2012-06-26	MICE	Tape space on Castor for mice reconstructed data
83564	Green	Less Urgent	In Progress	2011-06-25	2012-06-26	MICE	Software area for MICE data reconstruction
68853	Red	Less Urgent	On hold	2011-03-22	2012-06-25	N/A	Retirenment of SL4 and 32bit DPM Head nodes and Servers

Tier1 Operations Report 2012-06-27

RAL Tier1 Operations Report for 27th June 2012

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools