Tier1 Operations Report 2010-12-15

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 15th December 2010

Review of Issues during the week from 8th to 15th December 2010.

  • Disk server GDSS117 (CMSWanIn), which had been out of production since 26-27 October was returned to service on 9th December.
  • On Thursday 9th December a problem with the xrootd manager not behaving correctly was found (and fixed). This had been causing a large number of Castor transfers for Alice to fail since 4th December.
  • On Thursday 9th December the Site Dashboard was unavailable for around an hour while the machine hosting it was moved into R89.
  • Gdss68 (CMSFarmRead) was out of production from Thursday 9th to Monday 13th December. Following investigation it needed its RAID card replacing.
  • Gdss135 (AtlasFarm) was unavailable for some time during Friday 10th December. It produced an alarm shortly after midnight - there were problems on /var. It was returned to service early that afternoon.
  • Disk server GDSS111 (AtlasFarm) reported a problem on Saturday afternoon (11th Dec.) and was taken out of production. The problem was load, triggered by a drive failure, although some further problems were encountered as it did not initially see the replacement disk. The server was returned to production on Monday (13th).
  • Over the weekend there was very high load on the Castor GEN instance from T2K.
  • One disk server (gdss77 - CMSFarmRead) was out of production for some time after the power outage. It required the OS to be re-installed. It was nominally returned to production on Thursday (9th) although it was subsequently found it had a problem and this was not resolved until this morning (15th).
  • Concern was raised when a number of LHCb files were found to have checksums stored in Castor that did not agree with a checksum re-calculated for the file on disk. This was blocking migration to tape. On investigation these were found to be cases where the original transfer into Castor had not completed successfully. An error message is returned to the user in these cases, but unless the file is explicitly removed from Castor it will remain in this state. A later version of Castor corrects the checksum discrepancy but the issue of the files needing to be explicitly cleaned up remains for the user.
  • From around 02:00 on Tuesday morning (14th Dec) until shortly after 11am this morning (15th) we had failed over to using the backup OPN link to CERN.
  • The planned 'Warning' (or 'At Risk') periods in the last week passed without incident. However, a change in the GOC DB meant that what was declared as Warning or At Risk was picked up as an outage by some VO systems and GridView. The periods were:
    • Power outage in the Atlas building over the weekend of 11/12 December.
    • UPS test in R89 on Monday 13th December.

Current operational status and issues.

  • Intermittent problems has been seen by CMS in the connectivity of worker nodes with their squids. These have occurred for several periods of around 20 minutes (Two on Monday 6th Dec, again on Wednesday night (8/9) and Friday night (10/11), also when there was a reload of the RAL Site Access Router on Tuesday morning (14th). All (apart from the last one) of these coincide with external network fail-overs. The mechanism by which this leads to the squids rejecting contact from the worker nodes is not yet understood.
  • Problems with a particular batch of disk servers that has been responsible for a significant number of disk server failures. In response to this we plan to move data off this batch of servers and this process has been started - with six of the Atlas disk servers draining at the moment.
  • Performance of LHCb disk servers continues to be monitored for performance. The maximum number of LHCb batch jobs continues to be held at 1200 for this last week.
  • Transformer TX2 in R89 is still out of use.

Declared in the GOC DB

  • None.

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Application of kernel update to batch server (some small risk to batch services - possibly tomorrow)).
  • Rolling update of microcode on second half of tape drives.
  • Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem. Possible dates for this are:
    • Atlas - During planned downtime of Central Atlas Services between 17-19 January 2011.
    • CMS - Late January or Early February 2011.
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change). (Proposed date: 17-19 January 2011.)
  • Address permissions problem regarding Atlas User access to all Atlas data.

Plans for Christmas and New Year Holidays

The Tier1 will remain up over the holiday. Details of the plans have been published via a blog entry at:

http://www.gridpp.rl.ac.uk/blog/2010/12/10/ral-tier1-%E2%80%93-plans-for-christmas-holiday/

Entries in GOC DB starting between 8th and 15th December 2010.

There were no unscheduled entries in the GOC DB for this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site SCHEDULED AT_RISK 13/12/2010 08:45 13/12/2010 11:25 2 hours and 40 minutes At Risk during UPS test.
Whole Site (except lcgic01) SCHEDULED WARNING 11/12/2010 08:00 13/12/2010 10:00 2 days, 2 hours Systems at risk during power work in building hosting networking equipment.
lcgic01 SCHEDULED OUTAGE 10/12/2010 14:00 13/12/2010 10:00 2 days, 20 hours System unavailable during electrical work in building over weekend.
srm-atlas SCHEDULED OUTAGE 06/12/2010 08:00 08/12/2010 12:15 2 days, 4 hours and 15 minutes Upgrade of Atlas Castor instance to version 2.1.9