Tier1 Operations Report 2009-11-18

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 18th November 2009.

This is a review of issues since the last meeting on 11th November.

Current operational status and issues.

  • The problem of batch work occasionally ending up on the 'wrong' node has not yet been resolved. A set of nodes has been put aside and is being used to create a small batch farm to try and investigate the issue.
  • Investigations into the cause of the problems with disk arrays that hosted the various Oracle databases have made a significant advance. Noise has been found on the current delivered by the UPS. (The voltage had been tested before and is/was clean.) This is believed to be due to a mismatch between the power supplied by the UPS and the load in the UPS room. A test bypassing the UPS will be made tomorrow (18th November) and this will confirm (or not) this hypothesis. In the meantime we continue running with the temporary arrangement on alternative disk arrays.
  • Ongoing problem today (18th November) on WMS01 which is not accepting new jobs - under investigation.
  • Updating (new kernels) following the recent security warning are essentially done for more vulnerable nodes. Worker nodes done on a rotation, so reduced batch capacity while this took place.
  • There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is under investigation.
  • A possible mismatch between tape contents and Castor meta-data is being investigated. This dates from 2007, and appears related to bugs in old versions of Castor. This is thought to affect a small number of files.

Review of Issues during week 11th to 18th November.

  • During the updates for the Castor service (on Tuesday 10th November) some problems were encountered reconnecting to the databases and there was a short service outage. The cause of this has now been understood and a fix applied.
  • Problem seen on Castor 'GEN' instance on Monday 16th November as T2K added in to the Castor Information Provider.

Advanced warning:

  • Thursday 19th November. At Risk for test of UPS bypass. The test itself will take place 08:30 to 09:30.

Table showing entries in GOC DB starting between 11th and 18th November.

Service Scheduled? Outage/At Risk Start End Duration Reason
MyProxy (lcgrbp01) SCHEDULED OUTAGE 17/11/2009 08:00 17/11/2009 09:00 1 hour Outage during application of kernel update.
FTS SCHEDULED OUTAGE 17/11/2009 07:00 17/11/2009 09:00 2 hours Outage for application of kernel updates. Includes time to drain service ahead of the intervention.
All CEs SCHEDULED AT_RISK 11/11/2009 09:30 11/11/2009 12:00 2 hours and 30 minutes At Risk during application of OS kernel updates.
3D (lugh, ogma) & lhcb-lfc SCHEDULED AT_RISK 11/11/2009 09:00 11/11/2009 13:00 4 hours At Risk during application of quarterly Oracle patches.