Tier1 Operations Report 2010-11-17

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 17th November 2010

Review of Issues during the week from 3rd to 10th November 2010.

  • On Tuesday (16th Nov) disk server gdss326 (ALTASMCTAPE) was taken out of production following a problem accessing files on it. This is a D0T1 service class and there were 25 un-migrated files on the server when it failed. The server was restarted and these have since been copied to tape. The server is now in 'draining' mode while the RAID array rebuilds following drive replacement and will be returned to full production when that has completed.
  • On Monday (15th) CMS reported a corrupt file on disk from their reprocessing. We are tracking such occurrences.
  • On Monday (15th) LHCb reported a file access problem. Traced to a problem updating grid mapfiles on the new LHCB SRMs and fixed.
  • On Wednesday (10th Nov)LHCb disk servers upgraded to 64-bit OS. There was subsequently a problem with rootd authentication. A workaround was put in place. A fix has now been found, and will be applied shortly after testing. Checksums were enabled for Castor LHCb disk files on Monday (15th).
  • Over the weekend there was very high load on Atlas storage (Data rates both in and out greater than 1GByte/sec). We saw heavy load on disk servers but they were running OK. The limitation was in the SRMs. On investigation the SRM load appears to be dominated by status requests from the FTS, which is being followed up. Work is under way to upgrade and re-configure the Atlas SRMs. This will see an increase from 4 to 6 nodes, two of which will be dedicated to the back end processing.
  • Gdss280 (CMSFarmRead) which had reported FSProbe errors twice. It has now been replaced by Gdss289 (from cmsSpare).
  • On Thursday 11th Nov the LHCb SRMs were upgraded. The two existing systems were replaced with three more powerful systems.

Current operational status and issues.

  • On Monday 8th November Disk server gdss398 (AtlasDataDisk) failed with FSProbe errors. Attempts to 'fsck' the disks made it clear the file systems had major problems. All data on one of the three Castor partitions has been lost. Some critical files were manually copied off, but most files on the server were declared lost. A Post Mortem is being prepared at:
 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101108_Atlas_Disk_Server_GDSS398_Data_Loss
  • During the night 26-27 October disk server GDSS117 (CMSWanIn) failed with a read only filesystem. It was removed from production. Since then it has been re-running the acceptance tests before being returned to production.
  • The problem with the cooling of one of the power supplies on the tape robot was investigated during a downtime of the tape system on 2nd Nov. A further intervention (scheduled for 23rd Nov) will be required to fix the problem.
  • Testing of an EMC disk array with one of its power supplies connected to the UPS supply continues. Further discussions on removing the electrical noise have taken place and a solution is being prepared.
  • Transformer TX2 in R89 is still out of use. As a result of work carried out on 18th Oct. on TX4 the indication is that the cause of the TX2 problem relates to over sensitive earth leakage detection. Plans are being made to resolve this.
  • The upgrade of the Castor CMS instance to version 2.1.9 is ongoing at the time of the meeting (Wed 17th Nov.)

Declared in the GOC DB

  • Upgrade CMS castor instance - Tuesday to Thursday 16-18 November. (Ongoing at time of meeting.)
  • Tape System Unavailable Tuesday 23rd November. During work on tape robot to resolve problem with power supply cooling.

Advanced warning:

The following items remain to be scheduled/announced:

  • Upgrade CE08 to a CREAM CE. Mon-Thu 22-25 Nov.
  • Upgrade castor Atlas instance to version 2.1.9 - Monday to Wednesday 6 - 8 December.
  • Power outage of Atlas building weekend 11/12 December. Whole Tier1 at risk.
  • Monday 13th December (just after LHC 2010 run ends): UPS test.
  • Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem.
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change)
  • Address permissions problem regarding Atlas User access to all Atlas data.

Entries in GOC DB starting between 10th and 17th November 2010.

There was one unscheduled entry in the GOC DB for this last week, the 'warning' while the LHCb SRMs were replaced in a rolling change.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-cms SCHEDULED OUTAGE 16/11/2010 08:00 18/11/2010 18:00 2 days, 10 hours Upgrade of CMS Castor instance to version 2.1.9.
srm-lhcb UNSCHEDULED WARNING 11/11/2010 10:00 11/11/2010 14:00 4 hours Service at risk while SRMs upgraded by rolling change.
srm-lhcb SCHEDULED OUTAGE 10/11/2010 08:00 10/11/2010 16:00 8 hours Service unavailable while disk servers upgraded to 64-bit OS.