Tier1 Operations Report 2009-09-30

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 30th September 2009.

This is a review of issues since the last meeting on 23rd September.

Current operational status and issues.

  • Some problems seen after the Castor CIP upgrade. It looks like these were confined to the failure of OPS VO SAM tests and were finally resolved during this morning (30th September).
  • We await the patched version of the SRM that will fully resolve problems seen by Atlas after the SRM 2.8-0 upgrade some ten days ago. In the meantime a workaround is in place.
  • Problems on the disk system underneath the Oracle RAC are being investigated. Failures have recurred similar to those of 10th September. The Oracle patch applied since the first incident has meant the database services have continued running and there has been no impact on the service. However, we are running without a mirror copy of the Castor databases at the moment.
  • One of the nodes in the Oracle RAC that supports Atlas and LHCb has a problem and keeps rebooting. This node forms part of the voting system within the RAC and it will be necessary to have a downtime to re-configure a different node in its place. Most likely will schedule for next Tuesday 6th (October).
  • Swine ‘Flu. As reported last week: We continue to track this and ensure preparations are in place should significant portion of our staff be away or have to work offsite.

Review of Issues during week 23 to 30 September.

  • Batch system problems. At the end of last week (Thursday) significant problems were being encountered with the batch system. The PBS server was regularly crashing and we were unable to add more worker nodes into the batch system. It was running with very little (almost no) SL5 capacity. On Friday morning (25th September) the PBS server was updated (a minor version release) and this resolved the issue which appears to have been caused by a version mis-match between clients and server.
  • Problems were encountered on the Atlas Castor system on Tuesday morning (29th) when writing into the Atlas MCDISK space token. The cause was traced to some disk servers being put into production with an incorrect LSF configuration.
  • There was a double disk failure on a server gdss126 (part of cmsWanOut) resulting in the loss of the filesystem. This is a d0t1 service class and as there were no files awaiting migration no data was lost.

Advanced warning:

The following outages have been declared in the GOC DB:

  • Thursday 1st October: At Risk for LFC, FTS, Atlas 3D for updating of back end systems (Oracle ASM patches and OS kernel updates)

Other items:

  • We are discussing arrangements for the migration of the LHCb 3D system ('LUGH') to a 64-bit Oracle system.
  • Tuesday 6th October. Possible outage for resolution of problem with rebooting node in Oracle RAC behind Castor Atlas and LHCb instances.

Table showing entries in GOC DB starting between 23rd and 30th September.

Service Scheduled? Outage/At Risk Start End Duration Reason
All CEs UNSCHEDULED AT_RISK 29/09/2009 16:01 30/09/2009 14:00 21 hours and 59 minutes CEs At Risk as we are failing OPS VO SAM tests, but believe the CEs are working OK for other VOs. These failures follow modifications to the Castor Information provider which publishes storage information and are under investigation.
All CEs, SRMs and FTS UNSCHEDULED AT_RISK 29/09/2009 14:48 29/09/2009 16:00 1 hour and 12 minutes We are seeing some problems on OPS VO SAM tests following the upgrade of the Castor Information Provider that publishes information on castor. Declared an At Risk while this is investigated.
All Castor SCHEDULED AT_RISK 29/09/2009 12:00 29/09/2009 14:00 2 hours Castor at risk while CIP system is upgraded.
All CEs (batch) UNSCHEDULED AT_RISK 25/09/2009 08:30 25/09/2009 11:00 2 hours and 30 minutes At Risk for batch system. We are seeing some problems in the batch scheduler and are unable to increase batch capacity beyond a certain point. These problems will be investigated during this period.
WMS02 SCHEDULED OUTAGE 22/09/2009 16:00 30/09/2009 10:03 7 days, 18 hours and 3 minutes Outage for making the disks on the host hotswappable. This includes time for draining beforehand and subsequent re-installation.
Whole Site SCHEDULED AT_RISK 22/09/2009 08:00 22/09/2009 11:59 3 hours and 59 minutes At Risk during and following test of UPS.
All Castor and batch SCHEDULED OUTAGE 22/09/2009 07:45 22/09/2009 10:00 2 hours and 15 minutes Outage for tests of UPS in new Computer Room. During the tests transfers to/from Castor will be suspended.
FTS SCHEDULED AT_RISK 22/09/2009 07:00 22/09/2009 10:00 3 hours Transfers to the RAL Tier1 will be drained out during this time. This is in connection with the scheduled outage of Castor while tests are made on the UPS in the new computer room. (Other transfers will continue as normal).