Tier1 Operations Report 2012-01-18

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 18th January 2012

Review of Issues during the week 11th to 18th January 2012.

  • Thursday (12th Jan) We had a couple of Atlas SAM test failures due to a database deadlock for the Atlas stager database which occurred between 09:00 and 09:15. This cleared itself but did cause a set of file transfer failures.
  • On Monday (16th Jan) there was a problem with the network link between the UKLight and SAR routers that started at 12:18 and was fixed at 14:01. This affected data transfers to/from Tier2s that do not go over the OPN.
  • The planned updates to the SRMs ran into problems. The update to srm-atlas was delayed from Monday (16th) to Tuesday (17th), partly owing to the network problem (above) and partly as the planned work by Atlas centrally only took place on the Tuesday (17th). However, this update encountered problems and was backed out. The planned updates to the SRMs from the remaining Castor instances have been cancelled. These updates will be rescheduled once the problem is understood.

Resolved Disk Server Issues

  • GDSS568 (Atlas DataDisk) was taken out of production around midday on Monday (16th Jan) owing to a memory error. It was returned to production later that afternoon after memory had been replaced.

Current operational status and issues.

  • None.

Ongoing Disk Server Issues

  • GDSS400 (Atlas Tape - D0T1) failed around 03:00 on Sunday morning (15th January) with FSProbe errors. There were no files awaiting migration to tape and the server was withdrawn from production. It is undergoing tests.

Notable Changes made this last week

  • APEL update to UMD distribution. (Monday 16th Jan).
  • Doubled CPU/wall time limits for grid4000M queue (used by Atlas)

Forthcoming Work & Interventions

  • Update to Oracle 11 for Atlas 3D services (Ongoing today).
  • Update to the Castor Information Provider (CIP). (Thursday 26th January).
  • Atlas plan to move their LFC from RAL to CERN during the week beginning 6th February. We are aiming to co-locate a number of other changes into this time.

Declared in the GOC DB

  • Update to Oracle 11 for the LHCb LFC and 3D databases. (Tuesday 31st January).

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. There is a significant amount of work that will require to be done during the LHC stop at the start of 2012.

  • Infrastructure:
    • Move part of the cooling system onto the UPS supply. (Should not require service interruption.) (Planned for Tuesday 28th February).
  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
  • Castor:
    • SRM updates to version 2.11. (Need to be rescheduled)
    • Castor 2.1.11 upgrade.
    • Replace hardware running Castor Head Nodes.
    • Move to use Oracle 11g.
  • Networking:
    • Changes required to extend range of addresses that route over the OPN.
    • Re-configure networking in UPS room.
    • Install new Routing & Spine layers.
    • Final updates to the RAL DNS infrastructure (two DNS servers still to replace)
  • Fabric:
    • BIOS/firmware updates, equipment moves in machine room (consolidation of equipment in racks; some rack moves.), Othere re-configurations (adding IPMI cards, etc.)
    • Network changes, including:
      • Relocating rack housing central switch (C300), resolve known issue with network in UPS room.
      • Changes to accommodate new networking equipment.
  • Grid Services:
    • Updates of Grid Services (including LB, APEL, batch server) to UMD versions (mainly in new year).
  • VO:
    • Migrate Atlas LFC to CERN. (Expected in week beginning 6th Feb.)

Entries in GOC DB starting between 11th and 18th January 2012.

There were no unscheduled outages during this period.

Service Scheduled? Outage/At Risk Start End Duration Reason
castor 'GEN' instance SRMs (srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k) SCHEDULED WARNING 17/01/2012 09:00 17/01/2012 16:00 7 hours SRM upgrade
lcgapel0676.gridpp.rl.ac.uk, SCHEDULED OUTAGE 16/01/2012 12:00 16/01/2012 14:00 2 hours Migration from glite to UMD.
srm-atlas.gridpp.rl.ac.uk SCHEDULED WARNING 16/01/2012 09:00 16/01/2012 14:45 5 hours and 45 minutes Atlas SRMs being upgraded

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
77026 Red Less Urgent Waiting Reply 2011-12-05 2012-01-16 BDII
74353 Red very urgent Waiting Reply 2011-09-16 2012-01-16 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2011-12-15 Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077 Red less urgent in progress 2011-02-28 2011-09-20 Mandatory WLCG InstalledOnlineCapacity not published
64995 Red less urgent in progress 2010-12-03 2011-09-20 No GlueSACapability defined for WLCG Storage Areas