Tier1 Operations Report 2010-07-14

From GridPP Wiki
Revision as of 08:10, 21 July 2010 by John kelly (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 14th July 2010

Review of Issues during two week from 30th June to 14th July 2010.

  • On Friday 2nd July there was a problem on the disk array used to hold certain types of backups of the Castor databases. The array lost a disk, although the data integrity was maintained by the RAID system which rebuilt onto a spare. However, this triggered some access problems to this disk which prevented the writing of some of this backup information. A temporary workaround was put in place on for the weekend. On Monday (5th July) the nodes in the Oracle RAC system behind Castor were rebooted sequentially in order to restore access to the disk array. During this operation Castor was shutdown and the batch system paused.
  • On Monday 5th July, shortly after the successful completion of the work above, and unrelated to it, a significant problem was encountered. One of the DNS servers used by the Tier1 was not performing correctly and severely hampered operations. All services were declared as down for 2.5 hours in the afternoon until the problematic the DNS server was fixed. (Note: There had also been a problem on one of the site network routers on the Sunday (the day beforehand), that was not completely resolved until Tuesday morning. This was not directly a cause of the Tier1 problems, but was a separate incident happening at the same time.)
  • On Thursday 8th July there was a problem with an nfs server - csfnfs58. It caused some worker nodes to go offline.
  • A series of problems were seen with very high loads on the Atlas software server, with callouts on several nights between Wednesday and Saturday (7-10 July). Actions were taken to reduce the rate of Atlas job starts and modify some NFS mount parameters.
  • On Wednesday afternoon 30th June data loss was reported from a CMS disk server, gdss67. This server had been out of service and this was reported at the last meeting. Details of this data loss are given in a post mortem here:
 http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100630_Disk_Server_Data_Loss_CMS
  • On Tuesday afternoon and Atlas disk server, gdss231 (AtlasGroupDisk) was unavailable for a few hours during a hardware intervention.

Current operational status and issues.

  • gdss217 (AtlasDataTape) has memory problems. It was removed from service this morning (14th July). There are no un-migrated files on it.
  • gdss207 (AliceTape) reported problems from its disk controller and was taken out of service on Monday 5th July. This service had previously been out of production and was only returned to service on 30th June. It is under investigation.
  • As reported at the last meeting, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. Since then it has reported three power problems. The test continues.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken a fortnight ago. At the time of preparation of this report we await further information on this incident.
  • On Thursday 3rd June gdss474 (LHCbMDst) failed with a read only file system. Work is still ongoing to resolve hardware issues on this system.
  • Dust in the Computer Room - particularly the HPD room: Remedial work on the lagging should began yesterday (13th July) and is estimated to take several weeks.
  • Slow file transfers between RAL and SARA are being investigated. We need to schedule modifications to the network 'WAN' tuning on disk servers.

Declared in the GOC DB

  • Monday - Thursday 19-22 July. Site at Risk for transormer work (TBC)
  • Monday 19th July (9-3) - Outage on tape system for swap to spare controller.
  • Tuesday 20th July (8-2) - Outage on tape system for microcode update on tape robots.

Not in the GOC DB yet - but proposed:

  • Tuesday 20th July. At Risk on Atlas 3D (ogma) for SAN multipath configuration update.
  • Wednesday 21st July. At Risk on LHCb 3d/FTS (lugh) for SAN multipath configuration update.
  • Thursday 22nd July. At Risk on LFC/FTS (somnus) for SAN multipath configuration update.

Advanced warning:

The following items remain to be scheduled:

  • Closure of SL4 batch workers at RAL-LCG2 announced for the start of August.
  • Doubling of network link to network stack for tape robot and Castor head nodes.

Entries in GOC DB starting between 30th June and 14th July 2010.

There were 2 unscheduled outages during the last week. Both were related to the DNS problems on Monday 5th July. (One was an Outage, followed by an At Risk.)

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole site UNSCHEDULED AT_RISK 05/07/2010 15:50 06/07/2010 10:00 18 hours and 10 minutes Networking problems at the RAL site have been resolved as far as concerns the Tier1. However, we are leaving the site 'At Risk' overnight and for the first part of the morning of 6th July as further work elsewhere on the site network will take place.
Whole site UNSCHEDULED OUTAGE 05/07/2010 13:18 05/07/2010 15:49 2 hours and 31 minutes Site in downtime due to site wide networking issue.
All Castor & Batch. SCHEDULED OUTAGE 05/07/2010 08:50 05/07/2010 12:00 3 hours and 10 minutes Outage of Castor (and pause of batch system) to resolve problem of disk access on nodes in the Oracle RAC behind Castor.
lcgvo0598 SCHEDULED OUTAGE 01/07/2010 10:00 21/07/2010 18:00 20 days, 8 hours System being retired from service. }