Tier1 Operations Report 2010-10-27

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 27th October 2010

Review of Issues during the week from 20th to 27th October 2010.

  • On Wednesday 20th Oct. GDSS408 (AtlasMCDisk) crashed with a kernel panic. Errors indicated a memory fault. memory was replaced. After running a short memtest the server was returned to production the same day.
  • Also reported last week, on Wednesday (20th Oct) there was a problem with the SRM servers for Atlas. This recurred again that afternoon. The problem was traced to a user attempting to directly copy files elsewhere, rather than using the FTS. The user was blocked from accessing the SRMs. Good communications with the Atlas experiment also enabled them to take appropriate action quickly.
  • The upgrade to the Castor GEN instance, planned for Mon-Wed 25-27 October completed ahead of schedule Tuesday (26th) lunchtime. We did fail some SAM tests on CEs when the file replication checks were unable to copy files to/from the Castor GEN instance. This was resolved by moving the tests to another Castor instance.

Current operational status and issues.

  • Last night (26-27 October) disk server GDSS117 (CMSWanIn) failed with a read only filesystem. It was removed from production this morning. There were two un-migrated files on this system, and both have now been copied to another disk server. GDSS117 is being re-run through the acceptance tests for up to a week to flush out any problems.
  • Yesterday (26th Oct.) LHCb reported problems with malformed TURLs being returned by the LHCb SRMs. This is not yet understood.
  • Gdss280 (CMSFarmRead) had showed FSProbe errors and was taken out of production on Thursday 19th August. As reported last week this server was returned to production on the morning of 15 September). The server again gave FSPROBE errors and was taken back out of production the next day (16th). 30 un-migrated files were lost. A review of the problems encountered is being followed up via a post mortem.
  • Performance issues on Castor Disk Servers for LHCb: This is being kept under observation. Investigations were suspended during the Castor 2.1.9 upgrade but are being resumed now LHCb have re-started running batch work here. There is currently a limit of 800 simultaneous batch jobs on LHCb. This will be increased in a controlled manner when there is a waiting job queue and Castor/disk performance monitored.
  • As reported at previous meetings, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. The test is ongoing but some errors (roughly once/week) have been seen.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken. As a result of work carried out on 18th Oct. on TX4 the indication is that the cause of the TX2 problem relates to over sensitive earth leakage detection.
  • Atlas are looking to run some user jobs at RAL. In order to ensure these do not cause problems (i.e. excessive load) for the existing Atlas software server, these jobs will make use of a pilot CVMFS based solution for delivering Atlas software to the worker nodes.

Declared in the GOC DB

  • Tuesday 2nd November. "Warning" on lcgrbp01 (MyProxy) for introduction of Quattorized service.
  • Wednesday 3rd November. "Warning" on site-bdii for rolling update to glite 3.2 and SL5.

Advanced warning:

The following items remain to be scheduled/announced:

  • Monday 13th December (just after LHC 2010 run ends): UPS test.
  • Castor Upgrade to 2.1.9.
    • Upgrade CMS - during the week beginning 8 November
    • Upgrade ATLAS - during the week beginning 22 November
  • Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem.

Entries in GOC DB starting between 20th and 27th October 2010.

There were no unscheduled entry in the GOC DB for this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-alice., srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-superb, srm-t2k. SCHEDULED OUTAGE 25/10/2010 08:00 26/10/2010 12:20 1 day, 4 hours and 20 minutes Upgrade of Castor GEN instance to version 2.1.9.
Whole Site SCHEDULED AT_RISK 20/10/2010 10:00 21/10/2010 12:00 1 day, 2 hours Site At Risk during UPS maintenance.