Difference between revisions of "Tier1 Operations Report 2011-07-27"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 13:23, 27 July 2011

RAL Tier1 Operations Report for 27th July 2011

Review of Issues during the week from 20th to 27th July 2011.

  • On Friday (22nd July) there was a problem on the Atlas LFC that caused us to fail tests and put the UK cloud off-line for a while. This was traced to an Atlas user (who ran the tests) reaching the maximum numbers of directories (a million) that can be in a single directory in the LFC.
  • A problem that was seen particularly by CMS whereby some FTS transfers failed to complete has been diagnosed. The entire transfer took place but the FTS never received the notification of completion. The problem only occurred for transfers that took longer than around 2000 seconds. This was traced to the removal of a firewall rule on 14th July that effectively dropped the time-out on the connections through the firewall from 24 hours to about half an hour. The relevant rule was re-instated on Tuesday 26th and the problem resolved.
  • Disk Server Issues:
    • Saturday (23rd July) gdss523 (CMSFarmRead D0T1) gave FSPROBE errors and was taken out of production. It was returned to production in read only mode later that day.
    • Saturday (23rd July) just before midnight gdss335 (LHCbUser D1T0) failed with a kernel panic. It was returned to production on Tuesday morning (26th). There was initially some confusion about the status of this server resulting in delayed notification to LHCb.
    • Tuesday (26th July) gdss423 (AtlasDataDisk D1T0) developed memory a fault. It was taken out of production for while its memory was replaced, tested and the disks fsck'd. Returned to production the following morning (Wed 27th).
    • Wednesday (27th July) gdss434 (AtlasDataDisk D1T0) also showed memory faults and has been taken out of production.
  • Changes made this last week:
    • None

Current operational status and issues.

  • Following a routine maintenance check a problem has been located on the 11Kv feed into the computer building with an intermittent short taking place. Investigations are ongoing to isolate exactly where this occurs and understand how it can be fixed.
  • The following points are unchanged from previous reports:
    • Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated.
    • We have observer some packet loss on the main network link from the RAL site (not the route used by our data). RAL networking team are actively investigating this problem.

Declared in the GOC DB

  • None

Advanced warning:

The following items are being discussed and are still to be formally scheduled and announced:

  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Networking upgrade to provide sufficient bandwidth for T10KC tapes.
  • Microcode updates for the tape libraries are due.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.
  • Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).

Entries in GOC DB starting between 20th and 27th July 2011.

None