Tier1 Operations Report 2012-01-25

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 25th January 2012

Review of Issues during the week 18th to 25th January 2012.

  • On Wednesday (18th Jan) the Oracle 11 update of the Atlas 3D system took place. This overran into the evening but was completed successfully.
  • We have seen some failures of the Atlas SAM tests (along with corresponding file transfer failures). These are due to a known problem that causes a database deadlock. This is automatically cleared but means there is a problem for some time (some ten - twenty minutes). As an example there were two of these overnight Wed/Thu 18/19 Jan, with others occurring through the week. This is believed fixed in the planned SRM V2.11 update.
  • On Sunday morning (22nd Jan) there was a problem with two of the site DNS servers that led to an outage (declared in the GOC DB for around 7 hours).
  • On Monday afternoon (23rd Jan) (some hours after the successful CMS SRM update) there was an interruption to the CMS SRM for around 40 minutes owing to an operational error.

Resolved Disk Server Issues

  • GDSS400 (Atlas Tape - D0T1) was reported as having failed on Sunday morning (15th January). This system has been replaced with another server. GDSS400 will be used for spares.

Current operational status and issues.

  • None.

Ongoing Disk Server Issues

  • None

Notable Changes made this last week

  • Atlas 3D database updated to Oracle version 11. (Wednesday 18th Jan).
  • cmsWanIn and cmsFarmRead diskpools were merged into cmsTapeShaun (Friday 20th Jan).
  • CMS SRM upgraded to Version 2.11 (Monday 23rd Jan.)
  • Updated Errata across worker nodes and many Grid services nodes.

Forthcoming Work & Interventions

  • Update to the Atlas SRM. (Monday 30th January - TBC)
  • Update to the LHCb SRM. (Wednesday 1st February - TBC)
  • Atlas plan to move their LFC from RAL to CERN during the week beginning 6th February. We are aiming to co-locate a number of other changes into this time.
  • Move of T10KB tape servers.

Declared in the GOC DB

  • Update to the GEN Castor instance SRM. (Thursday 26th January)
  • Update to the Castor Information Provider (CIP). (Thursday 26th January).
  • Update to Oracle 11 for the LHCb LFC and 3D databases. (Tuesday 31st January).

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. There is a significant amount of work that will require to be done during the LHC stop at the start of 2012.

  • Infrastructure:
    • Move part of the cooling system onto the UPS supply. (Should not require service interruption.) (Planned for Tuesday 28th February).
  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
  • Castor:
    • Castor 2.1.11 upgrade.
    • Replace hardware running Castor Head Nodes.
    • Move to use Oracle 11g.
  • Networking:
    • Changes required to extend range of addresses that route over the OPN.
    • Re-configure networking in UPS room.
    • Install new Routing & Spine layers.
  • Fabric:
    • BIOS/firmware updates, equipment moves in machine room (consolidation of equipment in racks; some rack moves.), Othere re-configurations (adding IPMI cards, etc.)
    • Network changes, including:
      • Relocating rack housing central switch (C300), resolve known issue with network in UPS room.
      • Changes to accommodate new networking equipment.
  • Grid Services:
    • Updates of Grid Services (including LB, APEL, batch server) to UMD versions (mainly in new year).
  • VO:
    • Migrate Atlas LFC to CERN. (Expected in week beginning 6th Feb.)

Entries in GOC DB starting between 18th and 25th January 2012.

There was one unscheduled outages during this period. This is for the DNS problem on Sunday morning (22nd Jan.)

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-cms.gridpp.rl.ac.uk, SCHEDULED OUTAGE 23/01/2012 10:00 23/01/2012 11:18 1 hour and 18 minutes Update of CMS SRM to version 2.11
Whole site UNSCHEDULED OUTAGE 22/01/2012 04:30 22/01/2012 11:35 7 hours and 5 minutes possible DNS issues at RAL

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
77026 Red Less Urgent Waiting Reply 2011-12-05 2012-01-16 BDII
74353 Red very urgent Waiting Reply 2011-09-16 2012-01-16 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2011-12-15 Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077 Red less urgent solved (automatic ticket closing) 2011-02-28 2012-01-25 Mandatory WLCG InstalledOnlineCapacity not published
64995 Red less urgent solved (automatic ticket closing) 2010-12-03 2012-01-25 No GlueSACapability defined for WLCG Storage Areas