Tier1 Operations Report 2009-10-07

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 7th October 2009.

This is a review of issues since the last meeting on 30th September.

Current operational status and issues.

  • Major problems with the hardware underneath both the Castor Oracle databases and those used to support the LFC, FTS and 3D databases. All these services are currently unavailable. This is an drastic extension of a problem that occured starting on 10th September but was contained to just the mirrored set of disks under the Castor Databases. On Sunday 4th October the disk system that supports the primary disk subsusbstem for teh Cator databases fauikled. As the mirrored set were not in place owing to ongoing failures this halted the Castor services completely. Then around midday on Tuesday 6th October both the disk subsystems (primary and mirror) hosting the LFC, FTS and 3D databases failed within a few minutes of each other. At the moment we are restoring most of these databses to alternative hardware so as to restore services while we investigate the problems with the newer disk systems.
  • Problem seen for Atlas where reads from a particular service class are triggering Disk-to-Disk copies before delivering the data. This has resulted in a significantly increased number of timeouts.
  • Problem for LHCb with contention for access to files in a particular service class.
  • One of the nodes in the Oracle RAC that supports Atlas and LHCb has a problem and keeps rebooting. This node forms part of the voting system within the RAC and it will be necessary to have a downtime to re-configure a different node in its place.
  • The patched version of the SRM (2.8-1) has been installed for srm-atlas. Await installation for other SRMs (delayed by the database hardware problems).
  • Swine ‘Flu. As previously reported: We continue to track this and ensure preparations are in place should significant portion of our staff be away or have to work offsite.

Review of Issues during week 30 September to 7th October.

  • Some problems seen after the Castor CIP upgrade. These were finally resolved shortly before the meeting on the 30th. Noted here for completeness.
  • Thursday 1st October: Problem for LHCb with contention for access to files in a particular service class. Resolved by Castor team.
  • Friday 2nd October: BaBar disk server gdss165 sufferered two consecutive disk failures. A disk failed in this server on the 1st October. Whilst the rebuild did not start automatically there was a manual intervention and the array rebuilt OK. There was another disk failure in the same system on the morning of the 2nd. The array was again rebuilt but was taken it out of production while this took place. System put back in production by end of Friday afternoon.
  • There was a batch problem over the weekend with a log file on the PBS server exceeding its maximuum size. This was resolved following a callout. A more definitive fix has been to upgrade to a 64-bit version of the PBS server which has been done, profiting from the other problems.

Advanced warning:

There are no scheduled outages declared in the GOC DB. However, we will need to reschedule the installation of the updated SRM for the CMS, LHCb and GEN Castor instances.

Table showing entries in GOC DB starting between 30th September and 7th October.

Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor and CEs UNSCHEDULED OUTAGE 07/10/2009 14:00 08/10/2009 12:00 22 hours Following the failure of the disk systems that host the Oracle databases behind Castor we are having to restore the databases from backup. We will run on alternative hardware temporarily while the current hardware problems are understood and the system re-certified.
FTS, FTM, LFCs, 3D (lugh, ogma). UNSCHEDULED OUTAGE 06/10/2009 18:00 07/10/2009 14:00 20 hours Following the failure of the disk systems that host the Oracle databases behind these services we are having restore some of the databases from backup and will migrate to alternative hardware while the underlying problems are resolved.
FTS, FTM, LFCs, 3D (lugh, ogma). UNSCHEDULED OUTAGE 06/10/2009 14:03 06/10/2009 18:00 3 hours and 57 minutes Outage while the problem on the Oracle systems behind the LFC and FTS systems is being investigated.
All Castor and CEs UNSCHEDULED OUTAGE 06/10/2009 14:00 07/10/2009 14:00 24 hours Outage to investigate the ongoing problems with the hardware behind the Castor Oracle database.
FTS, FTM, LFCs. UNSCHEDULED OUTAGE 06/10/2009 12:25 06/10/2009 14:30 2 hours and 5 minutes We have just had a problem on the Oracle systems behind the LFC and FTS systems. Being looked at.
srm-atlas, srm-lhcb, CE06, CE08 SCHEDULED OUTAGE 06/10/2009 10:00 06/10/2009 12:00 2 hours Outage for reconfiguration of the Oracle RAC behind the Atlas and LHCb Castor instances. This is to remove a faulty node within the RAC.
FTS SCHEDULED AT_RISK 06/10/2009 09:00 06/10/2009 12:00 3 hours At Risk while channels for RAL for Atlas and LHCb drained out ahead of Castor intervention. Other channles will be unaffected.
All Castor and CEs UNSCHEDULED OUTAGE 05/10/2009 14:00 06/10/2009 14:00 24 hours Ongoing problems with the hardware that underlies the Oracle databases behind Castor are being investigated. The cause for multiple failures is not yet understood and we are announcing an extended downtime as these investigations contine.
Castor CMS, LHCb and GEN instances plus CEs. SCHEDULED AT_RISK 05/10/2009 10:00 05/10/2009 10:30 30 minutes At-risk to upgrade the Castor SRM's
lugh.gridpp.rl.ac.uk, SCHEDULED OUTAGE 05/10/2009 09:00 05/10/2009 12:00 3 hours LHCb 3D database (LUGH) migration to 64-bit system.
lcgce06, lcgce07, lcgce08 UNSCHEDULED OUTAGE 04/10/2009 15:47 05/10/2009 15:47 24 hours CAstor Databases down
lcgce01 UNSCHEDULED OUTAGE 04/10/2009 15:47 05/10/2009 15:47 24 hours Castor Databases down
All Castor and CEs UNSCHEDULED OUTAGE 04/10/2009 14:16 05/10/2009 14:00 23 hours and 44 minutes Castor down due to unknown problems with Oracle DBs. This is under investigation right now.
srm-atlas, CE06, CE08 UNSCHEDULED AT_RISK 01/10/2009 13:30 01/10/2009 14:00 30 minutes At-risk while the Castor SRM(atlas instance) is upgraded to the latest version
LFCs, FTS, Atlas 3D (ogma) SCHEDULED AT_RISK 01/10/2009 13:30 01/10/2009 16:30 3 hours At Risk during updating of back end systems (Oracle ASM patches and OS kernel updates).
Whole Site UNSCHEDULED AT_RISK 01/10/2009 09:00 01/10/2009 12:00 3 hours At Risk while (and after) network link to OPN doubled to increase bandwidth.
All CEs UNSCHEDULED AT_RISK 29/09/2009 16:01 30/09/2009 14:00 21 hours and 59 minutes CEs At Risk as we are failing OPS VO SAM tests, but believe the CEs are working OK for other VOs. These failures follow modifications to the Castor Information provider which publishes storage information and are under investigation.
All Castor, CEs and FTS. UNSCHEDULED AT_RISK 29/09/2009 14:48 29/09/2009 16:00 1 hour and 12 minutes We are seeing some problems on OPS VO SAM tests following the upgrade of the Castor Information Provider that publishes information on castor. Declared an At Risk while this is investigated.
All Castor SCHEDULED AT_RISK 29/09/2009 12:00 29/09/2009 14:00 2 hours Castor at risk while CIP system is upgraded.
lcgwms02 SCHEDULED OUTAGE 22/09/2009 16:00 30/09/2009 10:03 7 days, 18 hours and 3 minutes Outage for making the disks on the host hotswappable. This includes time for draining beforehand and subsequent re-installation.