Tier1 Operations Report 2011-02-02

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 2nd February 2011

Review of Issues during the week from 26th January to 2nd February 2011.

  • Wednesday 26th Jan: There was a very high number of LHCb batch jobs queued (around 10,000) whereas Dirac normally limits this number. The problem was traced to a configuration error on the CE such that LHCb were not correctly seeing the state of the batch system and so backing off job submissions.
  • Thursday 27th Jan: A problem that appeared on the Alice VO box (we were informed via a GGUS ticket) was traced to a mismatch of configuration between two CEs.
  • Thursday 27th January: GDSS435 (Atlas Atlas MCDisk) was taken out of production around 13:00 to investigate memory errors. It was returned to production around 11am the next day.
  • Monday 31st Jan. Batch problems following Castor outage. There were two problems: Firstly the batch system was opened up prematurely (before Castor back), and those batch jobs that started 'paused' for a few hours. Secondly a configuration error on the worker nodes caused many batch jobs to become stuck. This was resolved late afternoon. A significant number of (mainly queued) batch jobs were cleaned up Tuesday morning and the batch system recovered.
  • Tuesday 1st February: One of a pair of links to one of the switch stacks failed. A faulty network transceiver was replaced that day fixing the fault. As this is a double link this was only causing a performance reduction.
  • Tuesday 1st February: A tape fault has been discovered that has resulted in the loss of 78 LHCb files and this has been reported to LHCb. The fault was caused by a fault tape drive that overwrote part of the data on the tape.
  • Tuesday 1st February: There was a problem for a few hours around lunchtime whereby errors were returned when CMS tried to request file transfers via the FTS. The problem went away on its own and is not understood.
  • 24 hours was allowed for the drain of the batch system ahead of Monday's intervention, but it was clear that almost all jobs drained out within 12 hours.
  • The following work has been completed during the outages Sunday - Tuesday.
    • Castor Oracle databases updated to version 10.2.0.5
    • Update to 64-bit OS for CMS disk servers. Checksumming has since been turned on for one CMS service class (CMSWanIn).
    • Application of OS updates to batch server.
    • Network (VLAN) reconfiguration as preparation for making more OPN addresses available to the Tier1 and network link to tape systems doubled.

Current operational status and issues.

  • GDSS496 (CMSFarmRead) was taken out of production on Thursday 6th January following a problem. There were two un-migrated files on this server which had to be declared as lost to CMS (13th Jan). This system is redoing the acceptance testing before being returned to production. A Post Mortem is being prepared for this failure. See:
 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20110106_CMS_Disk_Server_GDSS496_Data_Loss
  • Friday 14th Jan. All three partitions on GDSS283 (CMSFarmRead) (which had failed on 25th December) were found to be corrupt. The result was the loss of 30 CMS files. A Post Mortem has been prepared for this data loss incident. See:
 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101225_CMS_Disk_Server_GDSS283_Data_Loss

A replacement server is being prepared to go into CMSFarmRead.

  • LHCb tape migrations blocking: This problem has been ongoing for some weeks and is caused by files with bad checksums which resulting from failed (i.e. incomplete) transfers of the files into Castor. These cause migrations to tape to block. A fix, to update a component of the LHCb Castor instance to a newer version, is being rolled out today - Wednesday 2nd Feb. Whether to roll this out to other Castor instances will be reviewed after this is complete.
  • We are aware of a problem with the Castor Job manager that can occasionally hang up. This happened once this last week (on Sunday 23rd January for Atlas) and remains an open issue. The service recovers by itself after around 30 minutes. Work is progressing on automating the capture of more diagnostic information when this occurs.
  • We have previously reported a problem with disk servers becoming unresponsive. We believe we understand the problem and are testing the fix, which is for the system logger (rsyslog) to update the central loggers using UDP rather than TCP connections. One updated server (GDSS310 CMSWanIn) did refuse ssh connections for some hours on Sunday (30th Jan) although the system did not show the other symptoms (large numbers of processes) previously seen in these cases and was still handling Castor requests.
  • Transformer TX2 in R89 is still out of use.

Declared in the GOC DB

  • 2nd February: At Risk on LHCb Castor for update to GridFTP component.
  • 3rd February: At Risk on Castor while Puppetmaster replaced.

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Outage of LFC/FTS & 3D for Oracle databases from version 10.2.0.4 to 10.2.0.5 - proposed for 9th February.
  • Update WAN tuning parameters on disk servers (CMS WanIn/Out - Feb 8th; rest of CMS - Feb 15th, All others - March 1st.)
  • Castor Name Server upgrade to 2.1.9
  • Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem for GEN instance.
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change). (OGMA done).
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Castor 2.1.9-10 upgrade. Possibly late March.
  • Firmware updates for central networking components (likely to have some short network breaks - maybe in March)

Entries in GOC DB starting between 26th January and 2nd February 2011.

There was one unscheduled entry, an "At Risk" which is for the update to the gridFTP component of Castor for LHCb.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-lhcb.gridpp.rl.ac.uk, UNSCHEDULED AT_RISK 02/02/2011 10:30 02/02/2011 14:30 4 hours Update to grid FTP component of Castor to resolve checksum problem for incomplete data transfers into Castor.
All Castor (srm endpoints) except CMS SCHEDULED OUTAGE 31/01/2011 08:00 31/01/2011 15:40 7 hours and 40 minutes Castor stop during upgrade of Oracle databases.
srm-cms SCHEDULED OUTAGE 31/01/2011 08:00 01/02/2011 09:48 1 day, 1 hour and 48 minutes Update to 64-bit OS for disk servers and Oracle update to Castor database.
All CEs (All batch) SCHEDULED OUTAGE 30/01/2011 08:00 31/01/2011 15:40 1 day, 7 hours and 40 minutes Batch drain and stop ahead of and during Castor stop for Oracle database updates.