Tier1 Operations Report 2010-10-13

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 13th October 2010

Review of Issues during the week from 6th to 13th October 2010.

  • Following resolution of the problem relating to LHCb access to the Conditions Database last week, a further problem arose. This was seen in two ways:
    • 1) Some files written to the LHCb Castor instance being recorded with a size of zero in the Castor Database.
    • 2) Many failures of LHCb SRM SAM test jobs.

Both were traced to time-outs within the LHCb castor instance. This problem appeared last Wednesday (6th) and were finally solved on Monday (11th). The cause was slow response from the database behind Castor. This in turn was due to the way the tasks were distributed on the Oracle RAC nodes. A reconfiguration on the 6th had led to a resource conflict between the LHCb stager and backup operations. A fix-up for those files recorded as having zero size was also applied.

  • Wednesday 13th Oct. There was a network problem in the Atlas building early this morning. A small number of services, including the Tier1 dashboard and the FTS front ends, were unavailable for a few hours.
  • Wednesday 13th Oct. A problem with an index in the database behind the Castor Atlas Stager caused the Atlas Castor instance to be unavailable for around 90 minutes this morning.

Current operational status and issues.

  • The upgrade of the LHCb Castor instance to version 2.1.9 was completed successfully. A remaining issue with new functionality (checksumming of files on disk) is being worked on.
  • A problem was reported on CE01 last week. This is being resolved by building a new replacement CREAM CE (CE09). This has been brought into use but some teething problems are being followed up.
  • Gdss280 (CMSFarmRead) had showed FSProbe errors and was taken out of production on Thursday 19th August. As reported last week this server was returned to production on the morning of 15 Sep). The server again gave FSPROBE errors and was taken back out of production the next day (16th). 30 un-migrated files were lost. A review of the problems encountered is being followed up via a post mortem.
  • GDSS81 (AtlasDataDisk)is being drained ahead of removal from production.
  • Performance issues on Castor Disk Servers for LHCb: This is being kept under observation. Investigations were suspended during the Castor 2.1.9 upgrade but will be resumed once LHCb re-start running batch work here.
  • As reported at previous meetings, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. The test is ongoing but some errors (roughly once/week) have been seen.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken. I had reported last week that the cause appeared to be related to temperature. However, further investigations suggest the cause is related to earth leakage detection. Two (of the four) transformers still require checking out as part of planned work following the first failure of TX2. One of these is planned to be done on 18th October.
  • On Thursday (7th) Atlas & CMS Disk servers running SL5 were rebooted to pick up new kernels. This went OK, although some CMS batch work failed despite the jobs being paused during the reboots.
  • Tuesday 12th October. The pair of nodes behind the site BDII were rebooted in turn to pick up new kernels.

Declared in the GOC DB

  • Thursday 14th October - Outage on the LFC/FTS/3D services while the nodes in the oracle RAC are rebooted.
  • Monday 18th October - Site At Risk for checks on one of R89 transformers.
  • Wednesday 20th October - Site At Risk for UPS maintenance.

Advanced warning:

The following items remain to be scheduled/announced:

  • Monday 13th December (just after LHC 2010 run ends): UPS test.
  • Castor Upgrade to 2.1.9.
    • Upgrade Gen (including ALICE) - during the week beginning 25 October
    • Upgrade CMS - during the week beginning 8 November
    • Upgrade ATLAS - during the week beginning 22 November

Entries in GOC DB starting between 6th and 13th October 2010.

There was one unscheduled entry in the GOC DB for this last week. This was this morning's outage on the Atlas Castor instance.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas UNSCHEDULED OUTAGE 13/10/2010 11:15 13/10/2010 12:49 1 hour and 34 minutes Problem on Atlas Castor instance which is currently unavailable. Fixing up corrupt index in database behind the Atlas Castor stager.
site-bdii.gridpp.rl.ac.uk, SCHEDULED AT_RISK 12/10/2010 09:00 12/10/2010 11:00 2 hours Reboots of the two nodes behind the site-bdii one after the other to update kernels.
Whole Site SCHEDULED AT_RISK 12/10/2010 08:00 12/10/2010 09:00 1 hour Site At Risk for network reset to resolve problem with PerfSonar Prefix.
lcgbdii.gridpp.rl.ac.uk, SCHEDULED AT_RISK 06/10/2010 09:00 06/10/2010 16:14 7 hours and 14 minutes At Risk while the five nodes in the top-bdii set are rebuilt one after the other to resolve a problem with disk partitioning.