Tier1 Operations Report 2009-11-04

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 4th November 2009.

This is a review of issues since the last meeting on 28th October.

Current operational status and issues.

  • We have a current (midday on Wednesday 4th) problem with inbound transfers from Tier2s. Outbound traffic, along with inbound traffic from T0 and T1s (except NDGF) is working OK. It appears to be a networking problem.
  • There is an ongoing problem with the tape system, the hardware engineer has been working on this. This has caused a migration backlog in Castor which should clear once the hardware issues are resolved.
  • WMS03 Outage from 30-Oct to 5-Nov. To enable hot-swappable disks.
  • The main ongoing issue is the investigations into the cause of the problems with the databases behind the Castor service. Work continues around improving the resilience and recovery options for the current temporary arrangement. In parallel work has been going on to identify the fault, but this has still not concluded. Two of the (problematic) disk arrays will be set-up in the 'LPD' room in R89 and certification tests run on them. This is with a view to getting services back onto these systems if/when confidence is regained.
  • Air Conditioning Problems. The air conditioning has worked OK for many weeks. However, we continue to track the underlying issues that led to the outages a couple of months ago. Planned work includes enabling the chillers to be restarted automatically. This will mitigate the effect of any restart of the BMS which currently halts the system.
  • Condensation water dripping into the tape robot: This also continues to be followed up. Gravity drains were installed for condensers in the first floor atrium. These are single pieces of pipe (no joins) to the outside of the building. Where these will connect under the chillers there will be a drip tray installed (currently being manufactured). Water detectors have been installed beneath both condensers and connected to alarms. The location of these will be adjusted once the new drip trays are in place. The chillers will remain off until this work is completed.
  • Swine ‘Flu. As previously reported: We continue to track this and ensure preparations are in place should significant portion of our staff be away or have to work offsite. We note the numbers are increasing nationally, and that a vaccine is now available.

Review of Issues during week 28th October to 4th November.

  • 28/10: 13:15: Problem on the Castor ‘GEN’ instance (srm-alice, hone, ilc, minos, mice).Caused by problem on the database behind the GEN instance SRM.
  • 28/10: There was a GGUS ticket from LHCb about very slow transfers from PIC to RAL. The problem went away between 23:00 – midnight.
  • 30/10 - 2/11: Disk server gdss168 (Part of Atlas MCDISK) unavailable over weekend.

Advanced warning:

The following have not yet been announced in the GOC DB:

  • 10-Nov (09:00 - 12:00) At Risk on LFC/FTS for investigation into fibrechannel switch problems.
  • 10-Nov (09:00 - 13:00) At Risk on Castor for quarterly Oracle patches to be applied.
  • 11-Nov (09:00 - 13:00) At Risk on 3D databases for quarterly Oracle patches to be applied.

Table showing entries in GOC DB starting between 28th October and 4th November.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgwms03 SCHEDULED OUTAGE 30/10/2009 09:00 05/11/2009 16:00 6 days, 7 hours Outage for making the disks on the host hotswappable. This includes time for draining beforehand and subsequent re-installation.
srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, ce.ngs, lcgce02, ce.ngs UNSCHEDULED OUTAGE 28/10/2009 13:17 28/10/2009 14:31 1 hour and 14 minutes Problems on the SRM for the Castor 'GEN' instance. Under investigation.