Tier1 Operations Report 2011-05-04

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 4th May 2011

Review of Issues during the week from 27th April to 4th May 2011.

  • There have been a series of load issues on the Atlas software server through Easter weekend and since. We have temporarily reduced the maximum number of Atlas production batch jobs. This cap was held in place through the bank holiday weekend 29th April - 2nd May and lifted on Tuesday (3rd May).
  • Problems were reported last week with the LHCb RAW/RDst (D0T1) service class in Castor which had been struggling. During last week the number of tape servers allocated to clearing the backlog of tape migrations was increased. Likewise the main FTS channel that was producing the load (RALLCG2-RALLCG2) was restricted. Having resolved the backlog of tape migrations the file recalls were then able to proceed successfully (the files were no longer garbage collected before use). On Tuesday (3rd) the RALLCG2-RALLCG2 was set to a better nominal value (5).
  • During last week there were problems with the AtlasScratchDisk in Castor. This service class contains 9 disk servers of which 7 were full. Only the two larger disk servers had space. This led to the load be concentrated on just two servers. The load was reduced by throttling Atlas FTS channels and restricting the number of user/analysis (atlas_pilot) batch jobs. Free space was redistributed by draining a couple of the full servers for a while. On Thursday the writes to this area were going ahead successfully, but many reads were still failing. On Friday it was found that one of the disk servers was not responding to the Castor job manager (LSF) requests. Fixing this finally resolved the issues on the service class. On Tuesday (3rd May) the Atlas batch and file transfer limits were restored.
  • On Saturday (30th April) FSPROBE reported a problem on GDSS293 (CMSFarmRead - D0T1) which was removed from production. The server was put back in service on Sunday (1st May.) On Tuesday (3rd May) the server was put into draining mode ahead of further investigations.

Current operational status and issues.

  • We are still seeing seen some intermittent problems with the site BDIIs. Until this is further understood the daemons are being restarted regularly.
  • Over the last two days there have been problems on a handful of the newer worker nodes with CVMFS.
  • Following a report from Atlas of failures accessing the LFC at RAL we have been tracking some network issues at RAL that are believed to be the underlying cause of this. A change was made on 19th April that has improved the network. The rate of packet loss was significantly reduced and this issue is now closed.
  • Atlas have reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). The investigation into this is ongoing.
  • The long standing problem with the Castor Job Manager occasionally hanging up has only been seen once since the Castor 2.1.10 update and this coincided with other networking issues. This issue is regarded as fixed, although the debugging checks remain in place should it recur.

Declared in the GOC DB

  • Tuesday 10th May 10:00-12:00: OS updates to FTS agent.

Advanced warning:

  • Monday 9th May (11:00-15:00): Castor At Risk for Oracle quarterly updates.
  • Tuesday 10th May: 09:00-11:00: At Risk on 3D services for Oracle patches.
  • Tuesday 10th May: 12:00-14:00: At Risk on LFC & FTS services for Oracle patches.
  • Tuesday 10th May: SRM 2.10-2 upgrade for the Castor GEN instance. (Re-scheduled from 15th April).
  • Overnight Tuesday-Wednesday 10/11 May: Maintenance on OPN link.
  • Wednesday 11 May. Routine maintenance on R89 UPS.

The following items are being discussed and are still to be formally scheduled:

  • Updates to Site Routers (the Site Access Router and the UKLight router) are required.
  • Upgrade Castor clients on the Worker Nodes to version 2.1.10 & Atlas request to add xrootd libraries to worker nodes.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Minor Castor update to enable access to T10KC tapes.
  • Networking upgrade to provide sufficient bandwidth for T10KC tapes.
  • Microcode updates for the tape libraries are due.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.

Entries in GOC DB starting between 27th April and 4th May 2011.

There were no entries in the GOCDB for this period.