Tier1 Operations Report 2012-02-29

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 29th February 2012

Review of Issues during the week 22nd to 29th February 2012.

  • Some problems relating to batch job submission have been investigated and fixed. These include correcting some information reported to the information system ("resources_default.walltime") and applying a patch to WMS01 to fix a proxy delegation problem seen by LHCb.
  • On Tuesday (28th) there was a short (around 15 minute) network issue which had minimal impact on Tier operations although we did see a spike in FTS transfers failures.

Resolved Disk Server Issues

  • None.

Current operational status and issues.

  • There is a known issue with the Atlas SRMs which is being investigated.

Ongoing Disk Server Issues

  • Wednesday 29th Feb. GDSS513 (LHCbDst - D1T0) removed from production following multiple drive failures.

Notable Changes made this last week

  • Monday 27 Feb. Upgrade of LHCb Castor instance to version 2.1.11-8.
  • Wednesday 29 Feb. Upgrade of GEN Castor instance to version 2.1.11-8. (Castor 2.1.11-8 upgrade now complete.)
  • Thursday 23 Feb. Application of Oracle "PSU" patches to Atlas 3D & LHCb 3D/LFC systems ("OGMA" & "LUGH")
  • Tuesday 28th Feb. Electrical work took place to prepare for moving part of the cooling system onto the UPS supply.
  • Updated drivers have been applied to tape servers which has increased the performance of the T10KB & C tape drives.

Forthcoming Work & Interventions

  • Next Tuesday (6th March) (TBC) Castor outage during the second (and final) step of the Castor database migration and includes enabling Oracle Data Guard.
  • Next Tuesday (6th March) (TBC) Apply network routing change required to extend range of addresses that route over the OPN.
  • Week beginning 5th March (TBC) FTS update to version 2.2.8.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.

  • Databases:
    • Regular Oracle "PSU" patches are pending for SOMNUS (LFC & FTS).
    • Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
  • Castor:
    • Update the Castor Information Provider (CIP) (Need to re-schedule.)
    • Move to use Oracle 11g (requires a minor Castor update.)
  • Networking:
    • Install new Routing & Spine layers for Tier1 network.
    • Main RAL network updates - early summer.
  • Fabric:
    • BIOS/firmware updates, Other re-configurations (adding IPMI cards, etc.)
  • Grid Services:
    • Updates of Grid Services (including WMS, MyProxy, LFC front ends) to EMI/UMD versions.

Entries in GOC DB starting between 22nd and 29th February 2012.

There were no unscheduled outages during this period.


Service Scheduled? Outage/At Risk Start End Duration Reason
Castor GEN instance (srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k). SCHEDULED OUTAGE 29/02/2012 08:00 29/02/2012 16:00 8 hours Update of GEN Castor instance to version 2.1.11-8
srm-lhcb SCHEDULED OUTAGE 27/02/2012 08:00 27/02/2012 13:06 5 hours and 6 minutes Update of LHCb Castor instance to version 2.1.11-8
lcgvo05 SCHEDULED WARNING 22/02/2012 11:00 24/02/2012 14:35 2 days, 3 hours and 35 minutes Outage on Atlas vobox for Alastair to investigate
srm-atlas SCHEDULED OUTAGE 22/02/2012 08:00 22/02/2012 12:50 4 hours and 50 minutes Update of Atlas Castor instance to version 2.1.11-8

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
79732 Green Less Urgent In Progress 2012-02-28 2012-02-28 hone hone jobs after submission through lcgwms03.gridpp.rl.ac.uk WMS are at Waiting status too long time.
79545 Red Top Priority Waiting Reply 2012-02-23 2012-02-24 LHCb Zombie jobs at RAL
79428 Red Less Urgent Waiting Reply 2012-02-21 2012-02-23 SNO+ glite-wms-job aborted
77026 Red Less Urgent In Progress 2011-12-05 2012-02-28 BDII
74353 Red Very Urgent Waiting Reply 2011-09-16 2012-02-27 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2012-02-21 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)