Tier1 Operations Report 2012-01-11

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 11th January 2012

Review of Issues during the week 4th to 11th January 2012.

  • On Thursday 5th Jan there was a planned intervention on the Castor databases and some rack moves that meant APEL & FTS services were unavailable for some time during the day. During the rack moves it had been necessary to power down the SRM nodes (amongst others). One of the Atlas SRM nodes failed to reboot and was removed from the DNS alias for srm-atlas.
  • On Friday 6th January there was a problem with the Atlas SRM. A reconfigured system was added to the atlas-srm alias around lunchtime but, owing to a misunderstanding, the appropriate firewall holes had not been opened for it. This was resolved during the afternoon.
  • On Tuesday 10th January there was planned update to two of the DNS servers at RAL. These two servers were used by the Tier1 but in preparation we had moved our DNS configurations on most systems to stop using these particular servers. However, there was an unexpected side effect of this change that appeared during the morning with reverse DNS lookups for Tier1 systems failing from outside RAL. It took some time to recognise the cause of the problems we were seeing (e.g. some, but not all, SAM tests failed). The problem itself was fixed late Tuesday evening. The RAL Tier1 downtime was ended on Wednesday morning.

Resolved Disk Server Issues

  • None

Current operational status and issues.

  • None.

Ongoing Disk Server Issues

  • None.

Notable Changes made this last week

  • Migration of the Castor databases to new hardware (Thursday 5th Jan). This is the first step in a series of changes to move all the Oracle databases to new hardware.
  • Errata & kernel updates applied to CEs & worker nodes. (Thursday 5th Jan)
  • Relocation of some racks of older equipment to make space in the computer room for new cooling arrangements. (Thursday 5th Jan)
  • Lcgui01 updated to UMD version of UI. (Monday 9th January.)
  • Update of final pair of DNS servers at RAL to new hardware. (Tuesday 10th January).

Forthcoming Work & Interventions

  • Monday/Tuesday 16/17 January: Atlas centrally down. We will be running various tests but services remain up.
  • During week 16-20 Jan: SRM updates to version 2.11.

Declared in the GOC DB

  • Monday 16th January (12:00 - 14:00) APEL update to UMD distribution.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. There is a significant amount of work that will require to be done during the LHC stop at the start of 2012.

  • Infrastructure:
    • Move part of the cooling system onto the UPS supply. (Should not require service interruption.)
  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
  • Castor:
    • Castor 2.1.11 upgrade.
    • Replace hardware running Castor Head Nodes.
    • Move to use Oracle 11g.
    • Update to the Castor Information Porvider (CIP).
  • Networking:
    • Changes required to extend range of addresses that route over the OPN.
    • Re-configure networking in UPS room.
    • Install new Routing & Spine layers.
    • Final updates to the RAL DNS infrastructure (two DNS servers still to replace)
  • Fabric:
    • BIOS/firmware updates, equipment moves in machine room (consolidation of equipment in racks; some rack moves.), Othere re-configurations (adding IPMI cards, etc.)
  • Grid Services:
    • Updates of Grid Services (including LB, APEL, batch server) to UMD versions (mainly in new year).
  • VO:
    • Migrate Atlas LFC to CERN.

Entries in GOC DB starting between 4th and 11th January 2012.

There were two unscheduled outages during this period. These were both relating to the DNS (reverse lookup) problem of Tuesday 10th January.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole site UNSCHEDULED OUTAGE 10/01/2012 17:00 11/01/2012 08:40 15 hours and 40 minutes Putting whole site in downtime while we investigate network/DNS problems. Extending till 12:00, January 11 2012
Whole site UNSCHEDULED OUTAGE 10/01/2012 12:00 10/01/2012 17:00 5 hours Putting whole site in downtime while we investigate network/DNS problems
lcgui01.gridpp.rl.ac.uk, SCHEDULED OUTAGE 09/01/2012 10:00 09/01/2012 13:00 3 hours Outage to upgrade the UI middleware
All Castor (srm-endpoints), lcgapel0676, lcgftm, lcgfts SCHEDULED OUTAGE 05/01/2012 08:30 05/01/2012 16:00 7 hours and 30 minutes Castor unavailable during migration of its Oracle databases to new hardware. Some re-organisation of racks in computer room at the same time affects APEL and FTS.
All CEs. SCHEDULED OUTAGE 04/01/2012 20:00 05/01/2012 16:00 20 hours Batch unavailable (with drain beforehand) during intervention on Castor system.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
78063 Green Urgent In Progress 2012-01-11 2012-01-11 CMS Unable to create interface with rfcp-RAL protocol Protocol rfcp-RAL not supported or unknown
77026 Red Less Urgent In Progress 2011-12-05 2011-12-15 BDII
74353 Red very urgent In Progress 2011-09-16 2011-12-07 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2011-12-15 Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077 Red less urgent in progress 2011-02-28 2011-09-20 Mandatory WLCG InstalledOnlineCapacity not published
64995 Red less urgent in progress 2010-12-03 2011-09-20 No GlueSACapability defined for WLCG Storage Areas