Tier1 Operations Report 2011-11-09

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 9th November 2011

Review of Issues during the week 2nd to 9th November 2011.

  • The Post Mortem (SIR) is completed for the problems with Castor, or rather database infrastructure behind Castor, over the weekend (Sat & Sun 22/23 Oct). See:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111022_Castor_Outage_RAC_Nodes_Crashing.

  • The Post Mortem (SIR) is still under preparation for the problems with the Atlas Castor instance a week ago (Saturday - Monday 30 Oct - 1 Nov).
  • We note a lack of availability reported by some VOs following the removal of our last non-cream CE, lcgce06, where the VOs had not allowed for cream CEs correctly in availability calculations.
  • On Thursday 3rd Nov. Following a configuration problem /tmp was unavailable on the batch worker nodes for around an hour.
  • On Friday 4th Nov. there was a problem with the CMS Castor JobManager which had stopped working at around 07:45. This was fixed by restarting it at 08:30.
  • Overnight Thursday-Friday 3/4th Nov one of the nodes in the PLUTO RAC (which hosts Castor databases for CMS & GEN) failed with a hardware problem. The database services within the PLUTO RAC failed over correctly and this had no operational impact. The node was replaced on Friday and after running for a few days as a member of the RAC, but not running the database, it was fully enabled yesterday.
  • This morning (Wed 9th Nov) one of the five top BDII nodes failed (lcgbdii0632). This system has now bee restarted and the service is again at full strength.

Resolved Disk Server Issues

  • On Friday 4th Nov. CMS reported file transfer problems. These were traced to a problem on a single disk server (GDSS295). Investigation revealed that three disk servers that had recently been returned to production were not configured correctly following a re-installation. The upshot was that some Castor set-ups were not done at the time and had to be applied later.

Current operational status and issues.

  • The slow data transfers into the RAL Tier1 from other Tier1s and CERN (i.e. asymmetrical performance) continue to be investigated. Improvements to rates to/from the RAL Tier1 have been made and now only two channels (NIKHEF to RAL; Birmingham to RAL) remain below an acceptable threshold.
  • We continue work with the Perfsonar network test system to understand some anomalies seen.

Ongoing Disk Server Issues

  • Gdss456 (AtlasDataDisk) failed with a read only file system on Wednesday 28th September. Fabric Team have completed their work on this server and it is awaiting re-deployment.
  • As reported in recent weeks, we are seeing a high number of 'SMART' errors reported by a particular batch of disk servers. Most of these are spurious and resolved by an updated version of the disk controller firmware. This update has been successfully applied to D0T1 disk servers and will be rolled out to the affected D1T0 disk servers over the next week or two.

Notable Changes made this last week

  • The update to the "WAN tuning" (tcp sysctl) settings that was removed last week as part of investigations into Atlas Castor problems have been partially replied to continue actions pursuing the asymmetric file transfer rates.
  • LCGCE06, our last non-CREAM CE has been drained ready for decommissioning.
  • Monday 7th November: Update to CIP (Castor Information Provider) to fix problem of over-reporting tape capacity.
  • The merger of the Atlas tape backed diskpools has been completed

Forthcoming Work & Interventions

  • Tuesday 15th November. Update to site firewall. We have been warned of a 30 minute break in connectivity.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced:

  • Update the disk controller firmware on D1T0 nodes in the batch of servers reporting spurious SMART errors.
  • There are also plans to move part of the cooling system onto the UPS supply that may require a complete outage (including systems on UPS).
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure. This will only proceed once the problem that caused the cancellation of the first stage of this work last week is understood and fixed.
  • Networking change required to extend range of addresses that route over the OPN.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Replace hardware running Castor Head Nodes (aimed for end of year).

Entries in GOC DB starting between 2nd and 9th November 2011.

There were no entries in the GOC DB for this last week.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
76023 Yellow very urgent in progress 2011-11-05 2011-11-08 LB query failed
75395 Red urgent waiting for reply 2011-10-17 2011-11-02 T2K WMS 'jumping'
74353 Red very urgent waiting for reply 2011-09-16 2011-11-07 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2011-11-07 Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077 Red less urgent in progress 2011-02-28 2011-11-02 Mandatory WLCG InstalledOnlineCapacity not published
64995 Red less urgent in progress 2010-12-03 2011-11-02 No GlueSACapability defined for WLCG Storage Areas