Tier1 Operations Report 2010-10-20

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 20th October 2010

Review of Issues during the week from 13th and 20th October 2010.

  • Following problems with CE01 a replacement CREAM CE (CE09) was brought into production last week. At last week's meeting some teething problems were reported. In particular jobs were unable to write output back to the CE. This was resolved on Wednesday morning.
  • On Friday (15th) there was a second case where some files written to the LHCb Castor instance being recorded with a size of zero in the Castor Database. Last week this was recorded as being due to database performance issues relating to the distribution of work across the Oracle RAC nodes. This time multiple copies of the LHCb stager daemon were found to be running. This has since been corrected and the problem resolved.
  • On the morning of Thursday (14th) disk server gdss415, part of AtlasSimStrip, crashed. It was taken out of service. Some problems were present on the non-Castor filesystem (e.g. /var). The system fsck'd. Some checksums were verified and the server returned to production on Friday (15th). Owing to a mistake disk server gdss414 was also disabled for a while during the time gdss415 was unavailable.
  • On Tuesday (19th) GDSS463 (LHCbDst) was taken out of service for the backplane to be replaced by the vendor's engineer. This was done successfully. However, the system fsck'd the disks after reboot. It was returned to production on the Wednesday morning (20th).
  • The planned outage of the LFC/FTS/3D services on Thursday 14th October to apply kernel updates to the Oracle RAC was completed on time.

Current operational status and issues.

  • There was a a problem this lunchtime (20th Oct) with the SRM servers for Atlas. The srmServer daemon died across several of the SRM servers. This was resolved just before this meeting.
  • This morning (20th Oct) GDSS408 (AtlasMCDisk)crashed with a kernel panic. Errors indicate a memory fault. This has been replaced and we are currently running a short memtest before returning the server to production.
  • Gdss280 (CMSFarmRead) had showed FSProbe errors and was taken out of production on Thursday 19th August. As reported last week this server was returned to production on the morning of 15 September). The server again gave FSPROBE errors and was taken back out of production the next day (16th). 30 un-migrated files were lost. A review of the problems encountered is being followed up via a post mortem.
  • GDSS81 (AtlasDataDisk)is being drained ahead of removal from production.
  • Performance issues on Castor Disk Servers for LHCb: This is being kept under observation. Investigations were suspended during the Castor 2.1.9 upgrade but are being resumed now LHCb have re-started running batch work here. The number of LHCb batch jobs was held at a maximum of 500 over the end of last week and the weekend, and increased to 800 on Monday morning (18th).
  • As reported at previous meetings, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. The test is ongoing but some errors (roughly once/week) have been seen.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken. As a result of work carried out on Monday (18th Oct) on TX4 the indication is that the cause of the TX2 problem relates to over sensitive earth leakage detection.

Declared in the GOC DB

  • Wednesday 20th October - Site At Risk for UPS maintenance.
  • Monday - Wednesday 25-27 October. Upgrade of Castor GEN instance.

Advanced warning:

The following items remain to be scheduled/announced:

  • Monday 13th December (just after LHC 2010 run ends): UPS test.
  • Castor Upgrade to 2.1.9.
    • Upgrade CMS - during the week beginning 8 November
    • Upgrade ATLAS - during the week beginning 22 November

Entries in GOC DB starting between 13th and 20th October 2010.

There was one unscheduled entry in the GOC DB for this last week. This was caused by the corrupt index in the Atlas stager database.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole site SCHEDULED AT_RISK 20/10/2010 10:00 21/10/2010 12:00 1 day, 2 hours Site At Risk during UPS maintenance.
Whole site SCHEDULED AT_RISK 18/10/2010 08:30 19/10/2010 12:00 1 day, 3 hours and 30 minutes At Risk on site while checks are made on one of the transformers providing power to computer building.
lcgfts.gridpp.rl.ac.uk, lfc-atlas.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lhcb-lfc.gridpp.rl.ac.uk, SCHEDULED OUTAGE 14/10/2010 09:00 14/10/2010 13:00 4 hours Outage while kernel updates applied to database nodes.
lcgfts SCHEDULED OUTAGE 14/10/2010 07:44 14/10/2010 09:00 1 hour and 16 minutes Drain of FTS ahead of the planned outage on FTS/LFC/3D services.
srm-atlas UNSCHEDULED OUTAGE 13/10/2010 11:15 13/10/2010 12:49 1 hour and 34 minutes Problem on Atlas Castor instance which is currently unavailable. Fixing up corrupt index in database behind the Atlas Castor stager.