Difference between revisions of "Tier1 Operations Report 2011-07-13"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 14:13, 13 July 2011

RAL Tier1 Operations Report for 13th July 2011

Review of Issues during the week from 6th July to 13th July 2011.

  • Overnight Thursday-Friday (7/8 July) problems with WMS03 (non-LHC WMS) which became unresponsive.
  • Between 13:30 – 13:45 on Friday (8th) there was a spate of failing Castor transfers but the cause was not found.
  • Overnight Monday/Tuesday (11/12 July) we failed some CE SAM tests owing to missing CAs on a small number of worker nodes.
  • Disk Server Issues:
    • Friday (8th July) Read only file system on Gdss208 AtlasScratchDisk (D1T0). Server out of production for around six hours before being returned to production in Read-Only mode. On Monday (11 July) the machine was put into draining mode. It will be drained and the RAID rebuilt.
    • Wednesday (13 July) There was a total of 7 drive failures overnight. Notably gdss193 AtlasScratchDisk (D1T0) had a double drive failure and is currently out of production.
  • Changes made this last week:
    • Deployment of new disk servers into the lhcbRawRdst service class.

Current operational status and issues.

  • We have observer some packet loss on the main network link from the RAL site (not the route used by our data). An intervention was made on the 5th July but this does not appear to have fixed it. (?)
  • Issues with LHCb staging files for the LHCbRawRDst service class: The new disk servers deployed into this service class have increased the capacity of this area and should alleviate this problem.
  • The following points are unchanged from previous reports:
    • Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated.

Declared in the GOC DB

  • None

Advanced warning:

The following items are being discussed and are still to be formally scheduled and announced:

  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Networking upgrade to provide sufficient bandwidth for T10KC tapes.
  • Microcode updates for the tape libraries are due.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.
  • Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
  • There is a requirement to re-number 4 clusters of worker nodes.

Entries in GOC DB starting between 6th July and 13th July 2011.

There were no unscheduled entries in the GOCDB for this period.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgui02 SCHEDULED WARNING 12/07/2011 10:00 12/07/2011 11:00 1 hour Applying regular updates to UI.