Tier1 Operations Report 2011-11-02

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 2nd November 2011

Review of Issues during the week 26th October to 2nd November 2011.

  • A Post Mortem (SIR) has been produced for the problems with Castor, or rather database infrastructure behind Castor, over the weekend (Sat & Sun 22/23 Oct). See:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111022_Castor_Outage_RAC_Nodes_Crashing.

  • Over the weekend and into the start of this week there were significant problems with the Atlas Castor instance and a GGUS alarm ticket was received on Saturday evening. There was a problem from the Saturday evening until Sunday lunchtime and then again from Sunday evening to Monday lunchtime. This appears to be related to Oracle performance and its interaction with Castor. A Post Mortem is being prepared for this incident.

Resolved Disk Server Issues

  • On Wednesday (26th) GDSS540 cmsFarmRead was taken out of production for its raid controller card to be swapped as it was (erroneously) reporting temperature problems. System returned the production the following afternoon.
  • On Saturday morning (29th Oct) GDSS332 (LHCbDst - D1T0) failed. The system was returned to production at the end of Monday afternoon after having a disk replaced and the raid card firmware updated.
  • On Monday morning GDSS480 (AtlasDataDisk) failed with a read only filesystem. A drive had failed and it required a reboot to see the replacement. It was returned to production later that day.
  • gdss296 (CMSFarmRead) has been out of production since 20th August. This server crashed whilst undergoing acceptance testing. It was returned to production yesterday (1st November).
  • Gdss396 (CMSWanIn) failed with FSPROBE problems on 13th September. This resulted in the loss of a single CMS file (since copied in from elsewhere). The server crashed under test, and was eventually returned to production on 1st Nov.
  • On 29th September gdss295 (CMSFarmRead) failed with FSPROBE errors. It failed again during acceptance tests before the issues were finally resolved. It was returned to production today (2nd Nov.)

Current operational status and issues.

  • We have been failing CMS SAM etsts since the withdrawal of our last LCG CE, CE06. This is being fixed by CMS.
  • Atlas report slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated. During the last week an update to tcp settings across a range of our disk servers has been made as part of these investigations. (Note: As part of the investigations into the Atlas problems over the weekend this was backed out. It will be re-appplied in due course.)
  • We continue work with the Perfsonar network test system including understanding some anomalies seen.

Ongoing Disk Server Issues

  • gdss456 (AtlasDataDisk) failed with a read only file system on Wednesday 28th September. Following draining investigations are ongoing for this system.
  • We are seeing a high number of 'SMART' errors reported by a particular batch of disk servers. Most of these are spurious and resolved by an updated version of the disk controller firmware. This update has been applied to D0T1 disk servers. We will run this for a week or so before rolling it out to the affected D1T0 disk servers.

Notable Changes made this last week

  • Re-wound the update to tcp sysctl settings made last week (as part of investigations into Atlas castor problems).
  • Tuesday 1st November. The microcode updates for the tape libraries. (Also two of the T10KC tape drives have been updated to the latest firmware.)
  • The planned intervention on the network link into RAL took place yesterday morning with no noticeable effect on Tier1 operations. (We had planned to drain the FTS and pause batch work, but this was not done.)
  • Retiring CE06, our last non-CREAM CE is being drained ready for decommissioning.
  • Re-configured CREAM CEs so that each VO has access to three of them.

Forthcoming Work & Interventions

  • Tuesday 15th November. Update to site firewall.
  • Update to CIP to fix over-reporting of tape capacity.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced:

  • Merge Atlas tape backed diskpools. (This has been started but there is more to do.)
  • Update the disk controller firmware on D1T0 nodes in the batch of servers reporting spurious SMART errors.
  • There are also plans to move part of the cooling system onto the UPS supply that may require a complete outage (including systems on UPS).
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure. This will only proceed once the problem that caused the cancellation of the first stage of this work last week is understood and fixed.
  • Networking change required to extend range of addresses that route over the OPN.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Replace hardware running Castor Head Nodes (aimed for end of year).

Entries in GOC DB starting between 26th October and 2nd November 2011.

There was one unscheduled Outages and two unscheduled Warnings in this last week. All relate to the Atlas Castor problems reported above and being post-mortemed.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole site SCHEDULED WARNING 01/11/2011 08:00 01/11/2011 10:00 2 hours At Risk during work on site network link. This work should not directly affect our services, but the connection is At Risk.
srm-atlas.gridpp.rl.ac.uk, UNSCHEDULED WARNING 31/10/2011 17:00 01/11/2011 12:30 19 hours and 30 minutes Warning state while atlas SRM recovers from downtime.
srm-atlas UNSCHEDULED WARNING 31/10/2011 12:30 31/10/2011 17:00 4 hours and 30 minutes Downtime while we investigate problems with atlas castor instance. Changing to warning 31/10/2011 15:26
srm-atlas. UNSCHEDULED OUTAGE 29/10/2011 20:50 31/10/2011 12:30 1 day, 16 hours and 40 minutes We are experiencing problems with the atlas castor instance. Now (11/10/31 16:13) changing to warning. Reverting back to Outage
lcgce06 SCHEDULED OUTAGE 28/10/2011 14:00 14/12/2011 14:00 47 days, 1 hour drain and decommission as lcg-CE