Tier1 Operations Report 2011-10-26

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 26th October 2011

Review of Issues during the week 19th to 26th October 2011.

  • On Wednesday morning (19th) There was a hang of one of the Oracle RAC nodes in the database behind the LFC/FTS & 3D services. Apart from a few minute outage on the LFC (during the failover) there was an interruption to the FTS service. Once the database issue had been resolved the FTS font ends and agents required some re-configuration and an FTS outage was declared for a couple of hours.
  • From the end of Wednesday through to Friday there were network tests running from both the T1 and the RAL Tier2 to Imperial College. The loading on the RAL network link through the firewall of the traffic from the RAL Tier2 led to some packet loss across on this route. This was compounded by other (non-PP) high usage around, and after, this time as well.
  • On Thursday afternoon (20th) the CMS Castor instance was unavailable for an hour or so. It looks like a recurrence of the old Castor "JobManager" hang (not seen for some months).
  • Problems with Castor, or rather database infrastructure behind Castor, over the weekend (Sat & Sun 22/23 Oct):
    • At around 4am Saturday morning three (out of five) nodes in one of the Oracle RACs that host the Castor databases rebooted. There was a few hours downtime for Atlas & CMS castor instances.
    • Later on Saturday: Around 11am two nodes in the other Oracle "RAC" cluster crashed (and did not reboot). Castor services carried on seamlessly. However, at around 18:00 that day a third node in that RAC crashed. With only one node remaining the decision was made to stop CMS & GEN Castor instances.
    • Overnight Sat/Sun, at around 04:15 - Another crash of a node in the first cluster. We took remaining (Atlas, LHCb) Castor instances down.
    • Services were restored around 20:30 on Sunday following a restart of the affected nodes, the commenting out NFS mounted disk areas that had become corrupt blocking the reboots and reconfiguring the database backups to compensate for this change.
    • Since Sunday we have been gradually (cautiously) opening up limits on FTS & Batch, being fully restored at lunch-time yesterday (Tuesday 25th).
    • Summary So far: The problems were caused by instabilities in the Oracle database infrastructure behind Castor. The Castor databases are divided across two Oracle RACs and both RACs suffered nodes crashing and, in some cases, failing to reboot. The failures for nodes to reboot were caused by corrupt areas on a disk array used to stage backups. Investigations are ongoing into the root cause and a Post Mortem is being produced.

Resolved Disk Server Issues

  • On Thursday (20th) a problem was found on three disk servers in CMSFarmRead (D0T1) - gdss303, 304, 305. Files were being garbage collected very quickly after being staged onto these servers. The problem was traced to these servers containing a lot of dark data. These were files that had not been removed from the servers before they were moved into this service class. The servers were drained and removed from production on the Friday (21st), cleaned up, and returned to production yesterday (25th).

Current operational status and issues.

  • Atlas report slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated. During the last week an update to tcp settings across a range of our disk servers has been made as part of these investigations.
  • We continue work with the Perfsonar network test system including understanding some anomalies seen.

Ongoing Disk Server Issues

  • Thursday 13th Sep: FSPROBE reported a problem on gdss396 (CMSWanIn). This resulted in the loss of a single CMS file (since copied in from elsewhere). The server has crashed under test and this is being followed up.
  • gdss296 (CMSFarmRead) has been out of production since 20th August. This server has also crashed under test and this is being followed up.
  • gdss456 (AtlasDataDisk) failed with a read only file system on Wednesday 28th September. Following draining investigations are ongoing for this system.
  • On Thursday (29th Sep) FSPROBE reported a problem on gdss295 (CMSFarmRead). This server has been put into test and these are ongoing.
  • We are seeing a high number of 'SMART' errors reported by a particular batch of disk servers. Most of these are spurious and resolved by an updated version of the disk controller firmware. This update has been applied to D0T1 disk servers. We will run this for a week or so before rolling it out to the affected D0T1 disk servers.

Notable Changes made this last week

  • Updated disk controller firmware in one batch of servers that are in D0T1 service classes.
  • Update to tcp sysctl settings to tune WAN transfers for approximately half the servers in 08 an d09 generations.
  • Drain and maintenance work on WMS02.

Forthcoming Work & Interventions

  • Tuesday 1st November (TBC). Microcode updates for the tape libraries. No tape access from 09:00 to 13:00. (Delayed from Tuesday 18th Oct.)
  • Tuesday 1st November: There will be an intervention on the network link into RAL that will last up to an hour. This should not (in theory) affect our link. We plan to declare a site "Warning" and will drain out the FTS and pause batch work.
  • Proposed from Friday (28th) - Retiring CE06, our last non-CREAM CE.
  • Re-configuring CREAM CEs so that each VO has access to three of them.
  • Update to CIP to fix over-reporting of tape capacity.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced:

  • Merge Atlas tape backed diskpools.
  • Update the disk controller firmware on D1T0 nodes in the batch of servers reporting spurious SMART errors.
  • There are also plans to move part of the cooling system onto the UPS supply that may require a complete outage (including systems on UPS).
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure. This will only proceed once the problem that caused the cancellation of the first stage of this work last week is understood and fixed.
  • Networking change required to extend range of addresses that route over the OPN.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Replace hardware running Castor Head Nodes (aimed for end of year).

Entries in GOC DB starting between 19th and 26th October 2011.

There were three unscheduled Outages and one unscheduled Warning in this last week.

  • One unscheduled Outage was for the FTS service following the hang of one of the Oracle database nodes behind that service.
  • Two Outages and one Warning were for the Castor database infrastructure problems over the weekend.
Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor (all SRM end points) UNSCHEDULED OUTAGE 23/10/2011 06:00 23/10/2011 21:30 15 hours and 30 minutes Problems with the Database Infrastructure behind Castor.
Castor CMS & GEN instances UNSCHEDULED OUTAGE 22/10/2011 19:30 23/10/2011 14:00 18 hours and 30 minutes Database Problems behind the CMS and GEN (Alice and others) Castor instances
All Castor (all SRM end points) UNSCHEDULED WARNING 22/10/2011 14:00 24/10/2011 14:00 2 days At risk following problems with the database infrastructure behind Castor,
lcgfts, lcgftm UNSCHEDULED OUTAGE 19/10/2011 10:30 19/10/2011 12:30 2 hours Problem with database behind FTS service. Now under investigation.
lcgwms02.gridpp.rl.ac.uk, SCHEDULED WARNING 13/10/2011 16:00 19/10/2011 12:45 5 days, 20 hours and 45 minutes drain and MySQL maintenance