Tier1 Operations Report 2011-04-27

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 27th April 2011

Review of Issues during the week from 20th to 27th April 2011.

  • On Thursday 21st there were problems with srm-atlas. These had started during the evening before. An outage was declared on srm-atlas during which no new Atlas batch jobs were started. The problem was traced to database performance and was fixed by a recalculation of the database statistics.
  • On Thursday (21st) a transceiver in one of the pairs of network links from the C300 to the network stack for the tape system was replaced as the link was seen not to be working.
  • On Thursday (21st) CE09 was unavailable for a short while. Following a reboot for kernel updates it needed to check (fsck) its disks.
  • On Saturday evening (23rd) there was a problem with the Castor GEN instance caused by a problem with the stager database. This was resolved Sunday morning.
  • There have been a series of load issues on the Atlas software server through Easter weekend and since. We have temporarily reduced the maximum number of Atlas production batch jobs.
  • Since Monday the LHCb RAW/RDst (D0T1) service class in Castor has been struggling. There were very high numbers of file transfers in and over the weekend the two tape drives used for writes became stuck and migration to tape was blocked. The disk area filed up. LHCb were also requesting files to be read from this service class. However, once files were recalled from tape the garbage collection was deleting them before they could be read by LHCb. More disk drives have been added to speed up the tape migration (we have been running with six drives for writes during the last couple of days). Furthermore the FTS channels has been throttled back. At the moment the backlog is slowly clearing.
  • On Monday (25th) Atlas reported a problem with CVMS on some worker nodes. Some cache corruption was found. (It is believed this problem is resolved with the latest version of CVMFS client.)
  • We have seen some intermittent problems with the site BDIIs during the last couple of days requiring restarts of the daemons.
  • There have been some problems with the AtlasScratchDisk in Castor which is very full.

Current operational status and issues.

  • Following a report from Atlas of failures accessing the LFC at RAL we have been tracking some network issues at RAL that are believed to be the underlying cause of this. A change was made on 19th April that has improved the network (significantly reduced the rate of packet loss). Will continue to track this for another week before closing this issue.
  • Atlas have reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). The investigation into this is ongoing.
  • The long standing problem with the Castor Job Manager occasionally hanging up was seen again last week, the first time since the Castor 2.1.10 update.

Declared in the GOC DB

  • None

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • SRM 2.10-2 upgrade for the Castor GEN instance. (Re-scheduled from 15th April).
  • Updates to Site Routers (the Site Access Router and the UKLight router) are required.
  • Upgrade Castor clients on the Worker Nodes to version 2.1.10 & Atlas request to add xrootd libraries to worker nodes.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Minor Castor update to enable access to T10KC tapes.
  • Networking upgrade to provide sufficient bandwidth for T10KC tapes.
  • Microcode updates for the tape libraries are due.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.

Entries in GOC DB starting between 20th and 27th April 2011.

There was one unscheduled entries during this period due to the problem on the Atlas SRM reported above.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas UNSCHEDULED OUTAGE 21/04/2011 09:45 21/04/2011 11:52 2 hours and 7 minutes Investigating problems with the Atlas SRM