Tier1 Operations Report 2012-02-08

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 8th February 2012

Review of Issues during the week 1st to 8th February 2012.

  • During last week, following the SRM update, the Atlas SRM was crashing. It was being restarted but there were periods when one of the SRMs was not responding. On Thursday (2nd Feb) a more aggressive re-starter was put in place, then on Friday (3rd) the cause was understood (to do with the method used to update the gridmap files). An effective temporary workaround has been put in place until the final fix is rolled out.
  • On Tuesday afternoon, 7th Feb, there was a network problem that lasted around an hour and particularly affected off-site connectivity to the Tier1. This was traced to a broadcast storm triggered by some equipment added elsewhere to the network.

Resolved Disk Server Issues

  • None.

Current operational status and issues.

  • The performance issue with the hardware being used to temporarily house the Castor databases has been largely alleviated by optimisations carried out by the database team. Furthermore, the cause of the Atlas SRM crashes has been understood and fixed. Whilst the performance of this disk array is lower than would be liked, this is a temporary arrangement and work is progressing on moving the Castor databases to their final hardware.
  • Since around Thursday (2nd Feb) LHCb have had pilot job submission problems caused by old jobs 'stuck' in the CEs. (Subject of GGUS ticket #78873 - see below)

Ongoing Disk Server Issues

  • None

Notable Changes made this last week

  • Castor Information Provider (CIP) updated on Monday (6th Feb.)
  • Network intervention (on C300 switch and network in UPS room) as well as re-positioning six racks of disk servers. (Ongoing today Wed 8th Feb.)
  • First set of move to new (UMD) batch server with non-LHC VOs being moved across. (Ongoing today Wed 8th Feb.)

Forthcoming Work & Interventions

  • Following the Castor Nameserver Upgrade to version 2.1.11 on Tuesday 14th February, it is planned to update the Castor stagers on:
    • Monday 20 Feb CMS
    • Wednesday 22 Feb ATLAS
    • Monday 27 Feb LHCb
    • Wednesday 29 Feb Gen

Declared in the GOC DB

  • Re-installation of WMS03 for upgrade to UMD distribution (ongoing)
  • MyProxy service being switched to new machine effectively upgrading from glite 3.1 to UMD versions. (At-Risk on Thursday 9th Feb.)
  • Upgrade of the Castor Nameserver to version 2.1.11 on Tuesday 14th February. This will take place at the same time as Atlas move their LFC from RAL to CERN. The Atlas TAGS system will also be moved during this time and the batch drain will be used to migrate LHC batch work to the new (UMD) batch server.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.

  • Infrastructure:
    • Move part of the cooling system onto the UPS supply. (Should not require service interruption.) (Planned for Tuesday 28th February).
  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
  • Castor:
    • Castor 2.1.11 upgrade (includes Replacement of hardware running Castor Head Nodes.)
    • Move to use Oracle 11g.
  • Networking:
    • Changes required to extend range of addresses that route over the OPN.
    • Install new Routing & Spine layers.
  • Fabric:
    • BIOS/firmware updates, Other re-configurations (adding IPMI cards, etc.)
    • Network changes, including:
      • Changes to accommodate new networking equipment.
  • Grid Services:
    • Updates of Grid Services (including WMS, FTS, LFC front ends) to EMI/UMD versions.

Entries in GOC DB starting between 1st and 8th February 2012.

There was one unscheduled outages during this period. This is for problems with the Atlas Castor instance on Monday (30th Jan.)

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole site SCHEDULED OUTAGE 08/02/2012 09:00 08/02/2012 16:00 7 hours Outage for intervention on core network within the RAL Tier1.
All CEs (All batch) SCHEDULED OUTAGE 07/02/2012 21:00 08/02/2012 09:00 12 hours Drain of batch system ahead of intervention on core network within the RAL Tier1.
lcgwms03 SCHEDULED OUTAGE 07/02/2012 15:00 10/02/2012 15:00 3 days System unavailable - EMI installation
All Castor SCHEDULED WARNING 06/02/2012 11:00 06/02/2012 14:00 3 hours Update to the Castor Information Provider (CIP). This will not affect the Castor service directly, but will change some of the information reported to the BDIIs.
srm-lhcb SCHEDULED OUTAGE 01/02/2012 10:00 01/02/2012 12:00 2 hours SRM Upgrade to version 2.11.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
78873 Amber Urgent In Progress 2012-02-02 2012-02-07 LHCb Redundant jobs at CREAM CEs at RAL
77026 Red Less Urgent Waiting Reply 2011-12-05 2012-02-03 BDII
74353 Red very urgent In Progress 2011-09-16 2012-02-01 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2011-12-15 Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)