Tier1 Operations Report 2009-10-21

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 21st October 2009.

This is a review of issues since the last meeting on 14th October.

Current operational status and issues.

  • The main ongoing issue is the investigations into the cause of the problems we have encountered with the databases behind the Castor service. Castor itself has been working well this week. All services are up. The databases are hosted on alternative hardware which, while it does not offer the full level of resilience we would have liked, is known to be reliable. The full castor service is available.
  • The patched version of the SRM (2.8-1) has been installed for srm-atlas. Await installation for other SRMs (delayed by the Castor database hardware problems).
  • Swine ‘Flu. As previously reported: We continue to track this and ensure preparations are in place should significant portion of our staff be away or have to work offsite.

Review of Issues during week 14th to 21st October.

  • It was confirmed that the Castor database used following the resumption of services on the 9th October was only up to data to around midnight on the 23/24 September. All data written to Castor from this point until the crash on the 4th October was lost. The experiments have been supplied with lists of files written to Castor, and therefore lost, between the 24th September and the 4th October.
  • It was realised that the restore of the Castor nameserver database to an earlier time led to the re-use of FileIDs within Castor. This ID should be unique. Castor would not have any knowledge of the re-use except that the stager databases for Atlas and LHCb instances were restored to a point much closer to the final crash (these are stored in a different database). Discussion with the Castor developers revealed a mechanism by which data put into Castor while re-using FileIDs in this way could theoretically open a way for data to be subsequently deleted in error. This mechanism requires a deliberate action (e.g. using the Atlas stager to delete a CMS file). Nevertheless, there is an additional risk that we believe is small. For this reason there was a short outage on Thursday 15th October during which this FileID was increased beyond the point of any re-use. Experiments have been supplied with lists of files added during this period of increased risk. If necessary files can be exported and re-imported into Castor to eliminate this risk.
  • Disk server gdss135 (Part of Atlas MCDISK) was unavailable 16 - 19 October following kernel panic.

Advanced warning:

There are no scheduled outages declared in the GOC DB. However, we will need to reschedule the installation of the updated SRM for the CMS, LHCb and GEN Castor instances.

Table showing entries in GOC DB starting between 14th and 21st October.

Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor & CEs UNSCHEDULED AT_RISK 15/10/2009 12:00 15/10/2009 13:00 1 hour At Risk following outage for database modification.
All Castor & CEs UNSCHEDULED OUTAGE 15/10/2009 11:00 15/10/2009 11:24 24 minutes Stop of Castor sevices (and CEs) while database update made to resolve issue following last week's problems.
lcgfts UNSCHEDULED AT_RISK 15/10/2009 10:00 15/10/2009 12:00 2 hours FTS channels to RAL will be drained ahead of the Castor intervention. Marking FTS At Risk.