Tier1 Operations Report 2012-02-22

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 22nd February 2012

Review of Issues during the week 15th to 22nd February 2012.

  • There were some failures of the Atlas SRM SAM tests early on Friday morning. At the moment there is a known problem with the Atlas SRMs that is worked around by an aggressive re-starter. However, this failed and one of the SRMs was manually restarted after a call-out.

Resolved Disk Server Issues

  • None.

Current operational status and issues.

  • There is a known issue with the Atlas SRMs (see above).
  • There is a problem with some batch job submission. This is believed to be when a VO uses information from the bdii in the submission process and was exposed by the batch server upgrade last week.

Ongoing Disk Server Issues

  • None

Notable Changes made this last week

  • On Monday (20th February) The CMS Castor instance was upgraded to version 2.1.11-8 with new hardware being introduced for the Atlas Castor head nodes.
  • The same update for the Atlas Castor instance has just been completed this morning (Wed. 22nd Feb.)

Forthcoming Work & Interventions

  • Thursday 23 Feb. Application of Oracle "PSU" patches to Atlas 3D & LHCb 3D/LFC systems ("OGMA" & "LUGH")
  • Tuesday 28th Feb - morning. Electrical work in morning to prepare for moving part of the cooling system onto the UPS supply. Some other electrical work carries on for the whole week (27 Feb - 2 Mar).
  • Week beginning 5th March (TBC) FTS update to version 2.2.8.

Declared in the GOC DB

  • Monday 27 Feb. Upgrade of LHCb Castor instance to version 2.1.11-8.
  • Wednesday 29 Feb. Upgrade of GEN Castor instance to version 2.1.11-8.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.

  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
      • Next step of these changes is to move Castor databases and enable Data Guard.
  • Castor:
    • Update the Castor Information Provider (CIP) (Need to re-schedule.)
    • Move to use Oracle 11g (requires a minor Castor update.)
  • Networking:
    • Changes required to extend range of addresses that route over the OPN.
    • Install new Routing & Spine layers.
  • Fabric:
    • BIOS/firmware updates, Other re-configurations (adding IPMI cards, etc.)
  • Grid Services:
    • Updates of Grid Services (including WMS, FTS, MyProxy, LFC front ends) to EMI/UMD versions.

Entries in GOC DB starting between 15th and 22nd February 2012.

There were no unscheduled outages during this period.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgvo05.gridpp.rl.ac.uk SCHEDULED OUTAGE 22/02/2012 11:00 21/02/2013 12:00 365 days, 1 hour Outage on Atlas vobox for Alastair to investigate
srm-atlas.gridpp.rl.ac.uk SCHEDULED OUTAGE 22/02/2012 08:00 22/02/2012 16:00 8 hours Update of Atlas Castor instance to version 2.1.11-8
srm-cms.gridpp.rl.ac.uk SCHEDULED OUTAGE 20/02/2012 08:00 20/02/2012 15:35 7 hours and 35 minutes Update of CMS Castor instance to version 2.1.11-8
lcgwms01.gridpp.rl.ac.uk SCHEDULED OUTAGE 09/02/2012 15:00 15/02/2012 12:00 5 days, 21 hours System unavailable - EMI installation

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
79428 Green Less Urgent In Progress 2012-02-21 2012-02-21 SNO+ glite-wms-job aborted
79720 Green Very Urgent Waiting Reply 2012-02-21 2012-02-22 t2k.org All jobs failing at RAL
79283 Red Top Priority In Progress 2012-02-16 2012-02-22 LHCb Job publishing problem for LHCb at RAL
77026 Red Less Urgent In Progress 2011-12-05 2012-02-03 BDII
74353 Red Very Urgent Waiting Reply 2011-09-16 2012-02-10 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2012-02-21 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)