Difference between revisions of "Tier1 Operations Report 2012-02-15"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 14:03, 15 February 2012

RAL Tier1 Operations Report for 15th February 2012

Review of Issues during the week 8th to 15th February 2012.

  • We note the significant planned outage on Wednesday (8th Feb.) during work on the core Tier1 network.
  • In the morning of Thursday (9th Feb)) there was a problem with the network link that provides the 'bypass' route for data traffic to Tier2s which was down between roughly 07:40 and 09:20. The fix was a disable/enable of the link between the "Site Access" and "UKLight" routers.
  • On Thursday (9th Feb) there were problems with the Castor Information Provider. It was found that cached information was being provided. Resolving this led to problems with SAM tests and the solution was to revert the CIP change.
  • On Thursday (8th Feb) the planned update to the MyProxy server ran into problems and was backed out.
  • Overnight Thursday/Friday there were problems with some of the Atlas SRMs which crashed. The cause is partly understood and is linked to a similar problem last week.
  • On Sunday (12th Feb 07:04) there was a repeat of the network link problem between the "Site Access' and "UKLight" routers. Networks believe that they understand the problem.
  • Tuesday (14th Feb) Site outage for various upgrades.
  • Tuesday (14th Feb) There was a site wide power anomaly at approx 12:00 midday. Three racks of disk servers rebooted. A small number required manual intervention but they all returned to production.

Resolved Disk Server Issues

  • gdss438 had faulty memory replaced during the tier1 downtime.

Current operational status and issues.

  • Atlas are not running any work at RAL due to the LFC migration.

Ongoing Disk Server Issues

  • None

Notable Changes made this last week

  • Tuesday (12th February) Significant outage to:
    • Upgrade the castor nameserver to 2.1.11-8
    • Migrate the batch farm to use the new UMD batch server and upgrade the worker nodes.
    • Physically move the atlas RAL TAGs database machines.
  • There was a separate ongoing work to migrate the Atlas LFC from RAL to CERN.
  • lcgwms01 was upgraded to EMI version

Forthcoming Work & Interventions

  • Following the Castor Nameserver Upgrade to version 2.1.11 on Tuesday 14th February, it is planned to update the Castor stagers on the following dates. New hardware will be brought into use as the various Castor elemnts are upgraded.
    • Monday 20 Feb CMS
    • Wednesday 22 Feb ATLAS
    • Monday 27 Feb LHCb
    • Wednesday 29 Feb Gen

Declared in the GOC DB

  • Re-installation of WMS01 for upgrade to UMD distribution (9-17 Feb. including the drain)
  • Atlas LFC is in downtime while it is being migrated. (14-02-2012 - 14-03-2012)

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.

  • Infrastructure:
    • Move part of the cooling system onto the UPS supply. (Should not require service interruption.) (Planned for Tuesday 28th February).
  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
  • Castor:
    • Update the Castor Information Provider (CIP) (Need to re-schedule.)
    • Move to use Oracle 11g.
  • Networking:
    • Changes required to extend range of addresses that route over the OPN.
    • Install new Routing & Spine layers.
  • Fabric:
    • BIOS/firmware updates, Other re-configurations (adding IPMI cards, etc.)
    • Network changes, including:
      • Changes to accommodate new networking equipment.
  • Grid Services:
    • Updates of Grid Services (including WMS, FTS, MyProxy, LFC front ends) to EMI/UMD versions.

Entries in GOC DB starting between 8th and 15th February 2012.

There were 2 unscheduled outages during this period. Both to fix problems discovered after our planned upgrade.

Service Scheduled? Outage/At Risk Start End Duration Reason
All CE's UNSCHEDULED OUTAGE 14/02/2012 13:00 14/02/2012 16:00 3 hours Putting CE's into downtime while we invetigate problems with batch system
All CASTOR Instances UNSCHEDULED WARNING 14/02/2012 13:00 14/02/2012 16:00 3 hours At-Risk on the tape system following upgrade of castor nameserver
lfc-atlas.gridpp.rl.ac.uk, SCHEDULED OUTAGE 14/02/2012 09:00 14/03/2012 12:00 29 days, 3 hours ATLAS LFC database migration to CERN
All CE's and CASTOR Instances SCHEDULED OUTAGE 14/02/2012 08:00 14/02/2012 13:00 5 hours Castor Nameserver Upgrade to version 2.1.11.
All CE's SCHEDULED OUTAGE 13/02/2012 20:00 14/02/2012 08:00 12 hours Batch drain ahead of intervention to upgrade Castor Nameserver to version 2.1.11.
lcgwms01.gridpp.rl.ac.uk, SCHEDULED OUTAGE 09/02/2012 15:00 15/02/2012 12:00 5 days, 21 hours System unavailable - EMI installation
lcgrbp01.gridpp.rl.ac.uk, SCHEDULED WARNING 09/02/2012 10:50 09/02/2012 12:20 1 hour and 30 minutes At-Risk for MyProxy service while service switched to new machine effectively upgrading from glite 3.1 to UMD versions.
Whole site SCHEDULED OUTAGE 08/02/2012 09:00 08/02/2012 16:00 7 hours Outage for intervention on core network within the RAL Tier1.
All CEs SCHEDULED OUTAGE 07/02/2012 21:00 08/02/2012 09:00 12 hours Drain of batch system ahead of intervention on core network within the RAL Tier1.
lcgwms03.gridpp.rl.ac.uk, SCHEDULED OUTAGE 07/02/2012 15:00 10/02/2012 15:00 3 days, System unavailable - EMI installation

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
77026 Red Less Urgent Waiting Reply 2011-12-05 2012-02-03 BDII
68853 Red less urgent On hold 2011-03-22 2011-12-15 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)