RAL Tier1 Operations Report for 27th June 2012
Review of Issues during the fortnight 13th to 27th June 2012
|
- The update of the Oracle database behind the FTS and (non-LHC VO's) LFC on Wednesday 13th June ran into problems. The FTS service was restored during the afternoon but the FTS database had to be re-created which caused the queue of transfers to be lost. The LFC database was in the end successfully moved to an Oracle 11 database but it was not available again until the morning of Friday 15th June. This issue has been the subject of a post mortem - the report can be seen at https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20120613_Oracle11_Update_Failure
- On Friday (15th June) there was a hardware failure on one of the Castor headnodes for the GEN instance. The functionality was moved across to one of the other headnodes during the afternoon. A short outage was declared for the Castor GEN instance.
- There was a problem with the routing of packets from the OPN to RAL. These were routed over the production networks from 13-21st June. (Outbound packets were routed correctly over the OPN link). This was triggered by a change to complete the extension of the address range being used for our disk servers. This caused a problem with file transfers between RAL and FZK only. (This particular problem was worked around (by FZK) on 18th.)
- There were problems with CMS SUM tests and CMS FTS transfers over the weekend of 23/24 June. These were caused by two problems - one resolved by restarting the SRMs, the other by changing a poor execution plan in the database.
- There was a spate of SUM test failures for the CEs In the morning of Tuesday 26th June. These were caused by the recurrence of a problem seen between teh CEs and the batch server. These were fixed by restarting torque/maui on the batch server.
Resolved Disk Server Issues
|
Current operational status and issues
|
- On Tuesday 12th June the first stage of switching ready for the work on the main site power supply took place. This initial switching should be completed today (13th June). The work on the two transformers is expected to take until 18th December.
- There is still a problem with the reporting of disk capacity to be followed up.
Ongoing Disk Server Issues
|
- GDSS607 (LHCbDst - D1T0) has been drained and, following further problems is still undergoing re-acceptance testing following earlier failures. Comment from Kash: Machine appears to be finally working well (after firmware update + replacement hardware), has passed 6 days of acceptance testing and will be handed back tomorrow.
Notable Changes made this last week
|
- Castor 2.1.11-9 update completed (Wed 13th June).
- The Database behind the FTS and non-LHC VO's LFC ("Somnus") has been updated to Oracle 11. (13-15 June)
- The RAL site access router was replaced as part of a significant upgrade on Tues 19th June.
- Castor Oracle 11 update. (Wed 27th June).
- Move the FTS database back to the 'Somnus' RAC (Wed 27th June).
- Drain and re-install of WMS02 for EMI installation WMSv3.3.5 (Thu 21 - Wed 27 June).
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- The FTS Agents are being progressively moved to virtual machines.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.12.
- Networking:
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services:
- Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)
- Infrastructure:
- The electricity supply company plan to work on the main site power supply for 6 months commencing 18th June. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.
Entries in GOC DB starting between 13th and 27th June 2012
|
There were four unscheduled outages during the last week: Two were extensions to the outage of the (non-LHC-VO's) LFC during the Oracle 11 upgrade, one was the failure of a Castor (GEN) headnode and the fourth an extension to the downtime for the major site networking upgrade.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
All Castor (All SRM endpoints) and CEs.
|
SCHEDULED
|
OUTAGE
|
27/06/2012 08:45
|
27/06/2012 14:30
|
5 hours and 45 minutes
|
Storage (Castor) and Batch (CEs) unavailable. Oracle database behind Castor being moved to Oracle 11.
|
lcgfts.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
27/06/2012 07:45
|
27/06/2012 11:00
|
3 hours and 15 minutes
|
Service drained then unavailable while back end (Oracle) database moved back to correct Oracle RAC.
|
lcgwms02.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
21/06/2012 12:00
|
27/06/2012 13:00
|
6 days, 1 hour
|
EMI installation WMSv3.3.5
|
Whole Site
|
UNSCHEDULED
|
OUTAGE
|
19/06/2012 11:00
|
19/06/2012 13:00
|
2 hours
|
Intervention on site network connection continuing.
|
Whole site
|
SCHEDULED
|
OUTAGE
|
19/06/2012 08:00
|
19/06/2012 11:00
|
3 hours
|
Site unavailable while network access routers upgraded.
|
castor GEN instance (srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k,
|
UNSCHEDULED
|
OUTAGE
|
15/06/2012 15:30
|
15/06/2012 16:35
|
1 hour and 5 minutes
|
hardware failure has caused a castor outage. The faulty hardware is being replaced
|
lfc.gridpp.rl.ac.uk
|
UNSCHEDULED
|
OUTAGE
|
14/06/2012 17:00
|
15/06/2012 09:12
|
16 hours and 12 minutes
|
Further extension of outage of LFC used by ILC, T2K, MICE, MINOS, SNO following problems with Oracle 11 update.
|
lcgwms01.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
14/06/2012 12:00
|
19/06/2012 09:30
|
4 days, 21 hours and 30 minutes
|
database maintenance and service re-configuration
|
lfc.gridpp.rl.ac.uk
|
UNSCHEDULED
|
OUTAGE
|
13/06/2012 17:00
|
14/06/2012 17:00
|
24 hours
|
Extension of outage of LFC used by ILC, T2K, MICE, MINOS, SNO following problems with Oracle 11 update.
|
All Castor (All SRM endpoints) and CEs.
|
SCHEDULED
|
OUTAGE
|
13/06/2012 08:00
|
13/06/2012 11:55
|
3 hours and 55 minutes
|
Castor and Batch (CEs) unavailable during Castor update to version 2.1.11-9
|
lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
13/06/2012 08:00
|
13/06/2012 17:00
|
9 hours
|
FTS and LFC services down for update of back-end database to Oracle 11.
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
83578
|
Green
|
Urgent
|
Waiting Reply
|
2011-06-26
|
2012-06-26
|
MICE
|
Tape space on Castor for mice reconstructed data
|
83564
|
Green
|
Less Urgent
|
In Progress
|
2011-06-25
|
2012-06-26
|
MICE
|
Software area for MICE data reconstruction
|
68853
|
Red
|
Less Urgent
|
On hold
|
2011-03-22
|
2012-06-25
|
N/A
|
Retirenment of SL4 and 32bit DPM Head nodes and Servers
|