Tier1 Operations Report 2011-07-20

From GridPP Wiki
Revision as of 12:26, 20 July 2011 by Tiju idiculla (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 20th July 2011

Review of Issues during the week from 13th July to 20th July 2011.

  • Wednesday (13th July 2011) High load seen on Atlas Castor instance. We halved the FTS channels and stopped active draining of gdss208.
  • Thursday (14th July 2011) There was an outage of the Atlas Castor instance from approx 15:00 until 17:00. This was due to a database problem (There was a subrequest with no id2type entry).
  • Tuesday (19th July 2011) WMS02 (LHC) had problems with the number of gridFTP connections set. The number of gridFTP connections was increased by 200.
  • Tuesday (19th July 2011) Commenced draining 2008 streamline worker nodes so they can be re-numbered and re-installed.
  • Wednesday (20th July 2011) Several batch workers knocked offline by an Atlas user leaving files in and filling /home/pool.
  • Disk Server Issues:
    • Thursday (14th July 23:13) gdss190 (Atlasscratchdisk D1T0) was removed from service due to it failing a read only file system check. It was checked over and returned to service on Friday.
    • Monday morning (18th July) gdss195 (atlasScratchDisk d1t0) was removed from service due to a double drive failure. It was returned to service on Tuesday (19 th July) morning.
    • Wednesday (20th July) gdss96 (cmsWanin d0t1) had a kernel panic and was removed from service.
  • Changes made this last week:
    • Thursday (14th) Switched all Castor instances to use the LRU garbage collection policies.
    • Thursday (14th) Set default ACLs for CMS.
    • Friday (15th) kernel/errata updates applied to CMS s/w server lcg0616.

Current operational status and issues.

  • The following points are unchanged from previous reports:
    • Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated.
    • We have observer some packet loss on the main network link from the RAL site (not the route used by our data). An intervention was made on the 5th July but this does not appear to have fixed it. (?)

Declared in the GOC DB

  • None

Advanced warning:

The following items are being discussed and are still to be formally scheduled and announced:

  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Networking upgrade to provide sufficient bandwidth for T10KC tapes.
  • Microcode updates for the tape libraries are due.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.
  • Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
  • There is a requirement to re-number 4 clusters of worker nodes.
  • There is a need to reboot all the VO software servers for kernel updates and errata.

Entries in GOC DB starting between 6th July and 13th July 2011.

There was an unscheduled outage due to the Castor/Database problems on 14th.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 14/07/2011 15:00 14/07/2011 17:18 2 hours and 18 minutes Downtime while we investigate castor problems