Tier1 Operations Report 2009-10-14

From GridPP Wiki
Revision as of 10:29, 14 October 2009 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 14th October 2009.

This is a review of issues since the last meeting on 7th October.

Current operational status and issues.

  • Major problems with the hardware underneath both the Castor Oracle databases and those used to support the LFC, FTS and 3D databases. All these services were unavailable for at least part of the week. The problems were caused by multiple failures in the disk systems that host the Oracle databases behind these services. We are currently running with the databases hosted on alternative hardware while the fault is investigated. We currently conclude the fault is environmental, and work so far points at it being electrical.
    • LFC and FTS were unabailable from lunchtime Tuesday (6th) to Wednesday (7th) late afternoon.
    • Castor was unavailable from Sunday 4th to the end of Friday afternoon (9th)
    • 3D databases (inlcuding lhcb-lfc) unavailable from lunchtime Tuesday (6th) to Monday (12th) early afternoon.
  • The restore of the Castor databases has introduced a problem. It appears that the Castor databases were restored to a point early on the 24th Septemeber and all files added to Castor between that date up to the failure on the 4th October may be lost.
  • The patched version of the SRM (2.8-1) has been installed for srm-atlas. Await installation for other SRMs (delayed by the database hardware problems).
  • Swine ‘Flu. As previously reported: We continue to track this and ensure preparations are in place should significant portion of our staff be away or have to work offsite.

Review of Issues during week 7th to 14th October.

  • The mainpoint here was the outages referred to in the 'Current operational status' section above.
  • There was a problem yesterday late afternoon (13th October) and last night. At around 16:00 the Tier1 started failing SRM SAM tests. Investigations showed a high CPU usage on some processes in the Oracle RAC. This was traced to a high load causing an out of memory condition, which stopped Oracle and then resulted in a node reboot. This was followed during the evening and night by problems on the CEs and batch system as well as srm-atlas. A DNS problem had occurred and evidence suggests this was the underlying cause.

Advanced warning:

There are no scheduled outages declared in the GOC DB. However, we will need to reschedule the installation of the updated SRM for the CMS, LHCb and GEN Castor instances.

Table showing entries in GOC DB starting between 7th and 14th October.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas, lcgce06 UNSCHEDULED OUTAGE 13/10/2009 19:47 13/10/2009 23:05 3 hours and 18 minutes Castor Atlas down due to local networking problems.
lcgce07 UNSCHEDULED AT_RISK 13/10/2009 10:00 13/10/2009 12:00 2 hours At risk to swap a broken hard disk
lhcb-lfc, lugh, ogma UNSCHEDULED OUTAGE 09/10/2009 18:00 12/10/2009 14:45 2 days, 20 hours and 45 minutes Extending downtime as the work to restore the services are still ongoing
All Castor & CEs UNSCHEDULED OUTAGE 09/10/2009 12:00 09/10/2009 17:00 5 hours Work is progressing restoring the Castor databases with the plan of restarting services tomorrow. The completion of this process, including a verification of Castor systems, is now estimated to be completed by the end of the afternoon.
lhcb-lfc, lugh, ogma UNSCHEDULED OUTAGE 08/10/2009 16:00 09/10/2009 18:00 1 day, 2 hours Extending downtime as the work to restore the services are still ongoing
All Castor & CEs UNSCHEDULED OUTAGE 08/10/2009 12:00 09/10/2009 12:00 24 hours Following hardware issues with the systems that host the Oracle databases behind Castor we are migrating the data to alternative hardware. Some issues have been encounterered with restoring the databasaes ahead of the migration. We are therefore extending the Castor downtime.
lhcb-lfc UNSCHEDULED OUTAGE 07/10/2009 17:30 08/10/2009 17:00 23 hours and 30 minutes The LHCb 3D database (including with lhcb-lfc) is being moved to an alternative system while investigations are ongoing into the hardware failures that have been encountered.
All Castor & CEs UNSCHEDULED OUTAGE 07/10/2009 14:00 08/10/2009 12:00 22 hours Following the failure of the disk systems that host the Oracle databases behind Castor we are having to restore the databases from backup. We will run on alternative hardware temporarily while the current hardware problems are understood and the system re-certified.
lcgftm, lcgfts, lfc-atlas, lfc, lhcb-lfc UNSCHEDULED OUTAGE 07/10/2009 14:00 07/10/2009 16:00 2 hours The restoration of the databases behind the LFC and FTS to alternative hardware is taking place but a little behind the schedule previously announced. This is a 2 hour extension to the outages for these services.
lugh, ogma UNSCHEDULED OUTAGE 07/10/2009 14:00 08/10/2009 17:00 1 day, 3 hours The hardware that hosts the 3D databases has become too unstable. These databases are being migrated to alternative hardware before resuming the service.
lcgftm, lcgfts, lfc-atlas, lfc, lhcb-lfc, lugh, ogma UNSCHEDULED OUTAGE 06/10/2009 18:00 07/10/2009 16:00 22 hours Following the failure of the disk systems that host the Oracle databases behind these services we are having restore some of the databases from backup and will migrate to alternative hardware while the underlying problems are resolved.
lcgftm, lcgfts, lfc-atlas, lfc, lhcb-lfc, lugh, ogma UNSCHEDULED OUTAGE 06/10/2009 14:03 06/10/2009 18:00 3 hours and 57 minutes Outage while the problem on the Oracle systems behind the LFC and FTS systems is being investigated.
All Castor & CEs UNSCHEDULED OUTAGE 06/10/2009 14:00 07/10/2009 14:00 24 hours Outage to investigate the ongoing problems with the hardware behind the Castor Oracle database.
ftm, lcgfts, lfc-atlas, lfc, lhcb-lfc UNSCHEDULED OUTAGE 06/10/2009 12:25 06/10/2009 14:30 2 hours and 5 minutes We have just had a problem on the Oracle systems behind the LFC and FTS systems. Being looked at.
lcgce06, lcgce08, srm-atlas, srm-lhcb SCHEDULED OUTAGE 06/10/2009 10:00 06/10/2009 12:00 2 hours Outage for reconfiguration of the Oracle RAC behind the Atlas and LHCb Castor instances. This is to remove a faulty node within the RAC.
lcgfts.gridpp.rl.ac.uk, SCHEDULED AT_RISK 06/10/2009 09:00 06/10/2009 12:00 3 hours At Risk while channels for RAL for Atlas and LHCb drained out ahead of Castor intervention. Other channles will be unaffected.
All Castor & CEs UNSCHEDULED OUTAGE 05/10/2009 14:00 06/10/2009 14:00 24 hours Ongoing problems with the hardware that underlies the Oracle databases behind Castor are being investigated. The cause for multiple failures is not yet understood and we are announcing an extended downtime as these investigations contine.