Tier1 Operations Report 2011-08-31

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 31st August 2011

Review of Issues during the week from 24th to 31st August 2011.

  • Note that RAL was closed both Monday and Tuesday 29/30 August). During the long weekend services ran as normal. There were some intermittent SAM test failures (on the Atlas SRM and on the non-cream CE, CE06).

Resolved Disk Server Issues

  • gdss346 (AtlasMCTape) crashed in the early hours of Thursday (26th Aug) with a reported memory fault. during the day it ran memtest for some time without issues and was returned to service later that day.

Current operational status and issues.

  • Following a routine maintenance check a problem was located on the 11kV feed into the computer building with an intermittent short taking place. This has now been located and following some internal switching the discharge has stopped. However, an intervention on the power systems in the building is required, although the extent of any power outage during this work is not yet known.
  • The problem of packet loss on the main network link from the RAL site remains. RAL networking team are actively investigating this problem.
  • Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated. We are in contact with people in the US investigating a similar issue.

Ongoing Disk Server Issues

  • None

Notable Changes made this last week

  • A number of new disk servers have been deployed during the last week:
    • 9 disk servers to AtlasdataDisk
    • 5 disk servers to AtlasGroupDisk
    • 5 disk servers to LHCbDst

Forthcoming Work & Interventions

  • Provisionally: Wednesday 7th September: Apply Oracle Security updates ("CPU") to the databases behind the LFC, FTS & 3D services.

Declared in the GOC DB

  • None.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced:

  • Intervention to fix problem on 11kV power feed to building and connect up some parts of the cooling system to the UPS. This is being planned but may require a complete outage (including systems on UPS).
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.
  • Networking (routing) change relating to https traffic outbound.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Networking upgrade to provide sufficient bandwidth for T10KC tapes.
  • Microcode updates for the tape libraries are due.
  • Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
  • Replace hardware running Castor Head Nodes (aimed for end of year).

Entries in GOC DB starting between 24th and 31st August 2011.

There were no entries during this week.