Tier1 Operations Report 2009-10-28

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 28th October 2009.

This is a review of issues since the last meeting on 21st October.

Current operational status and issues.

  • The main ongoing issue is the investigations into the cause of the problems we have encountered with the databases behind the Castor service. Overall Castor itself has been working well this week. All services are up. The databases are hosted on alternative hardware which, while it does not offer the full level of resilience we would have liked, is known to be reliable. The full Castor service is available. Work is also ongoing to check, and where appropriate improve, the resilience of the disk systems now in use for the databases. Furthermore, a further spare (in case one of these 'temporary' systems fails) is being checked out. The Tier1 'disaster management process' continues to track this.
  • Air Conditioning Problems. The air conditioning has worked OK for many weeks. However, we continue to track the underlying issues that led to the outages a couple of months ago. Plans are ongoing to modify the BMS system to ensure it is more robust. (At present a restart of the BMS will stop the air-conditioning). There are also plans to increase the number of chillers.
  • Condensation water dripping into the tape robot: This also continues to be followed up. Gravity drains were installed in both condenser drains in the first floor atrium. Water detectors have been installed beneath both condensers and connected to alarms.
  • Swine ‘Flu. As previously reported: We continue to track this and ensure preparations are in place should significant portion of our staff be away or have to work offsite. We note the numbers are increasing nationally, and that a vaccine is now available.

Review of Issues during week 21st to 28th October.

  • There was an unscheduled outage to Castor Monday afternoon (26th October).Over the weekend there had been a failure of a fibrechannel port on the Storage Area Network. This failure blocked access to a disk used for making backups. Once the hardware problem was fixed on Monday morning it was necessary to reboot some nodes to regain access to the disk. None of this affected the services being delivered. However, following the reboots the distribution of the various database processes was not optimal. This in turn led to poor performance and in turn to a severe degradation of service on Monday afternoon. The resolution was to re-assign the database process to different nodes to re-balance the system.
  • Disk server gdss143.gridpp.rl.ac.uk was unavailable from Friday 23rd to Monday 26th October due to a hardware fault. This machine is part of atlasSimStrip.
  • A new version of the Castor SRM (2.8-2) has been rolled out for all castor instances on Monday and Tuesday this week (26/27 Oct).

Advanced warning:

  • WMS03 Outage from 30-Oct to 5-Nov. To enable hot-swappable disks. This time includes draining jobs out.

Table showing entries in GOC DB starting between 21st and 28th October.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-alice, srm-cms, srm-dteam., srm-hone, srm-ilc, srm-mice, srm-minos SCHEDULED AT_RISK 27/10/2009 10:00 27/10/2009 12:00 2 hours At Risk during update to SRM 2.8-2.
All Castor & CEs UNSCHEDULED OUTAGE 26/10/2009 15:35 26/10/2009 16:45 1 hour and 10 minutes Investigating database access problems
srm-atlas, srm-lhcb SCHEDULED AT_RISK 26/10/2009 10:00 26/10/2009 12:00 2 hours At Risk during update to SRM version 2.8-2