Tier1 Operations Report 2012-06-27

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 27th June 2012

Review of Issues during the fortnight 13th to 27th June 2012
  • The update of the Oracle database behind the FTS and (non-LHC VO's) LFC on Wednesday 13th June ran into problems. The FTS service was restored during the afternoon but the FTS database had to be re-created which caused the queue of transfers to be lost. The LFC database was in the end successfully moved to an Oracle 11 database but it was not available again until the morning of Friday 15th June. This issue has been the subject of a post mortem - the report can be seen at https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20120613_Oracle11_Update_Failure
  • On Friday (15th June) there was a hardware failure on one of the Castor headnodes for the GEN instance. The functionality was moved across to one of the other headnodes during the afternoon. A short outage was declared for the Castor GEN instance.
  • There was a problem with the routing of packets from the OPN to RAL. These were routed over the production networks from 13-21st June. (Outbound packets were routed correctly over the OPN link). This was triggered by a change to complete the extension of the address range being used for our disk servers. This caused a problem with file transfers between RAL and FZK only. (This particular problem was worked around (by FZK) on 18th.)
  • There were problems with CMS SUM tests and CMS FTS transfers over the weekend of 23/24 June. These were caused by two problems - one resolved by restarting the SRMs, the other by changing a poor execution plan in the database.
  • There was a spate of SUM test failures for the CEs In the morning of Tuesday 26th June. These were caused by the recurrence of a problem seen between teh CEs and the batch server. These were fixed by restarting torque/maui on the batch server.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • On Tuesday 12th June the first stage of switching ready for the work on the main site power supply took place. This initial switching should be completed today (13th June). The work on the two transformers is expected to take until 18th December.
  • There is still a problem with the reporting of disk capacity to be followed up.
Ongoing Disk Server Issues
  • GDSS607 (LHCbDst - D1T0) has been drained and, following further problems is still undergoing re-acceptance testing following earlier failures. Comment from Kash: Machine appears to be finally working well (after firmware update + replacement hardware), has passed 6 days of acceptance testing and will be handed back tomorrow.
Notable Changes made this last week
  • Castor 2.1.11-9 update completed (Wed 13th June).
  • The Database behind the FTS and non-LHC VO's LFC ("Somnus") has been updated to Oracle 11. (13-15 June)
  • The RAL site access router was replaced as part of a significant upgrade on Tues 19th June.
Declared in the GOC DB
  • Castor Oracle 11 update. (Wed 27th June).
  • Move the FTS database back to the 'Somnus' RAC (Wed 27th June).
  • Drain and re-install of WMS02 for EMI installation WMSv3.3.5 (Thu 21 - Wed 27 June).
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • The FTS Agents are being progressively moved to virtual machines.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.12.
  • Networking:
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)
  • Infrastructure:
    • The electricity supply company plan to work on the main site power supply for 6 months commencing 18th June. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.


Entries in GOC DB starting between 13th and 27th June 2012

There were four unscheduled outages during the last week: Two were extensions to the outage of the (non-LHC-VO's) LFC during the Oracle 11 upgrade, one was the failure of a Castor (GEN) headnode and the fourth an extension to the downtime for the major site networking upgrade.

Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor (All SRM endpoints) and CEs. SCHEDULED OUTAGE 27/06/2012 08:45 27/06/2012 14:30 5 hours and 45 minutes Storage (Castor) and Batch (CEs) unavailable. Oracle database behind Castor being moved to Oracle 11.
lcgfts.gridpp.rl.ac.uk, SCHEDULED OUTAGE 27/06/2012 07:45 27/06/2012 11:00 3 hours and 15 minutes Service drained then unavailable while back end (Oracle) database moved back to correct Oracle RAC.
lcgwms02.gridpp.rl.ac.uk, SCHEDULED OUTAGE 21/06/2012 12:00 27/06/2012 13:00 6 days, 1 hour EMI installation WMSv3.3.5
Whole Site UNSCHEDULED OUTAGE 19/06/2012 11:00 19/06/2012 13:00 2 hours Intervention on site network connection continuing.
Whole site SCHEDULED OUTAGE 19/06/2012 08:00 19/06/2012 11:00 3 hours Site unavailable while network access routers upgraded.
castor GEN instance (srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k, UNSCHEDULED OUTAGE 15/06/2012 15:30 15/06/2012 16:35 1 hour and 5 minutes hardware failure has caused a castor outage. The faulty hardware is being replaced
lfc.gridpp.rl.ac.uk UNSCHEDULED OUTAGE 14/06/2012 17:00 15/06/2012 09:12 16 hours and 12 minutes Further extension of outage of LFC used by ILC, T2K, MICE, MINOS, SNO following problems with Oracle 11 update.
lcgwms01.gridpp.rl.ac.uk, SCHEDULED OUTAGE 14/06/2012 12:00 19/06/2012 09:30 4 days, 21 hours and 30 minutes database maintenance and service re-configuration
lfc.gridpp.rl.ac.uk UNSCHEDULED OUTAGE 13/06/2012 17:00 14/06/2012 17:00 24 hours Extension of outage of LFC used by ILC, T2K, MICE, MINOS, SNO following problems with Oracle 11 update.
All Castor (All SRM endpoints) and CEs. SCHEDULED OUTAGE 13/06/2012 08:00 13/06/2012 11:55 3 hours and 55 minutes Castor and Batch (CEs) unavailable during Castor update to version 2.1.11-9
lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk SCHEDULED OUTAGE 13/06/2012 08:00 13/06/2012 17:00 9 hours FTS and LFC services down for update of back-end database to Oracle 11.
Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
83578 Green Urgent Waiting Reply 2011-06-26 2012-06-26 MICE Tape space on Castor for mice reconstructed data
83564 Green Less Urgent In Progress 2011-06-25 2012-06-26 MICE Software area for MICE data reconstruction
68853 Red Less Urgent On hold 2011-03-22 2012-06-25 N/A Retirenment of SL4 and 32bit DPM Head nodes and Servers