Tier1 Operations Report 2012-02-01

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 1st February 2012

Review of Issues during the week 25th January to 1st February 2012.

  • On Thursday (26th Jan) There had been problems running Alice batch jobs for the last day or so. Solution was to disable CVMFS for Alice as this was making some old versions of software available to the worker nodes.
  • Over the weekend we were failing to run LHCb batch jobs (and failing SAM tests) owing to a configuration error after moving LHCb away from the high memory (6GB) queue.
  • On Monday (30th Jan) there were problems with the Atlas Castor instance. There was a successful update of the Atlas SRM. However, by coincidence the Atlas Castor stager database suffered a fragmentation problem. This badly affected performance and an outage was declared for the Atlas SRM for four hours.

Resolved Disk Server Issues

  • On Sunday (29 Jan) GDSS336 (GEN tape - D0T1) failed with a Read Only File system. This system is in the process of being withdrawn from service anyway.
  • On Monday (30th Jan) GDSS445 (AtlasDataDisk - D1T0) was out of service for a little under two hours while memory was replaced.

Current operational status and issues.

  • We are seeing some problems with the Atlas Castor instance. The SRM update has fixed a problem of database deadlocks, however this has exposed a performance issue with the hardware being used to temporarily house the Castor databases.

Ongoing Disk Server Issues

  • None

Notable Changes made this last week

  • GEN SRM upgraded to Version 2.11 (Thursday 26th Jan.)
  • Atlas SRM upgraded to Version 2.11 (Monday 30th Jan.)
  • Move of T10KB tape servers (only used by CMS) (Tuesday 31st Jan.)
  • Upgrade of LHCb databases (3D & LFC) to Oracle 11. (Tuesday 31st Jan.)
  • LHCb SRM upgraded to Version 2.11 (Wednesday 1st Feb.)

Forthcoming Work & Interventions

  • Network intervention to move the main C300 switch & re-configure networking in UPS room. Site Outage on Wednesday 8th Feb.
  • Atlas plan to move their LFC from RAL to CERN on Tuesday 14th February.
  • Castor 2.1.11 upgrade. This will start with the Nameserver update on Tuesday 14th February. (Will require Castor down.)
    • This will be followed by outages of the individual Castor instances as they are upgraded during the following week or so. (Proposed dates: CMS - Thu 16th Feb; Atlas - Mon 20th Feb; LHCb & GEN - Wed 22nd Feb.)
  • Updates to main batch server (Will require Farm drain) and MyProxy to be scheduled.
  • We are investigating offering a shared disk pool to all non-LHC VOs.

Declared in the GOC DB

  • Re-installation of WMS03 for upgrade to UMD distribution (ongoing)

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.

  • Infrastructure:
    • Move part of the cooling system onto the UPS supply. (Should not require service interruption.) (Planned for Tuesday 28th February).
  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
  • Castor:
    • Castor 2.1.11 upgrade (includes Replacement of hardware running Castor Head Nodes.)
    • Move to use Oracle 11g.
  • Networking:
    • Changes required to extend range of addresses that route over the OPN.
    • Install new Routing & Spine layers.
  • Fabric:
    • BIOS/firmware updates, equipment moves in machine room (consolidation of equipment in racks; some rack moves.), Other re-configurations (adding IPMI cards, etc.)
    • Network changes, including:
      • Changes to accommodate new networking equipment.
  • Grid Services:
    • Updates of Grid Services (including WMS, batch server, myProxy, FTS, LFC front ends) to EMI/UMD versions.

Entries in GOC DB starting between 25th January and 1st February 2012.

There was one unscheduled outages during this period. This is for problems with the Atlas Castor instance on Monday (30th Jan.)

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-lhcb.gridpp.rl.ac.uk, SCHEDULED OUTAGE 01/02/2012 10:00 01/02/2012 12:00 2 hours SRM Upgrade to version 2.11.
lhcb-lfc.gridpp.rl.ac.uk, SCHEDULED OUTAGE 31/01/2012 09:00 31/01/2012 15:30 6 hours and 30 minutes Update of Oracle software used by LHCb LFC and 3D databases.
lcgce09.gridpp.rl.ac.uk, srm-atlas.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 30/01/2012 16:45 30/01/2012 20:45 4 hours Outage while we investigate problems on ATLAS SRM
lcgwms03.gridpp.rl.ac.uk, SCHEDULED OUTAGE 30/01/2012 10:00 07/02/2012 12:00 8 days, 2 hours System unavailable for drain then re-installation of operating system.
srm-atlas.gridpp.rl.ac.uk, SCHEDULED OUTAGE 30/01/2012 10:00 30/01/2012 12:00 2 hours Update of Atlas SRM to version 2.11
Castor GEN instance (srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k) SCHEDULED OUTAGE 26/01/2012 10:00 26/01/2012 11:25 1 hour and 25 minutes Update to SRM 2.11 for Castor GEN instance.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
77026 Red Less Urgent Waiting Reply 2011-12-05 2012-01-16 BDII
74353 Red very urgent In Progress 2011-09-16 2012-01-31 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2011-12-15 Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077 Red less urgent In Progress (Reset status after automatic ticket closing by error) 2011-02-28 2012-01-26 Mandatory WLCG InstalledOnlineCapacity not published
64995 Red less urgent In Progress (Reset status after automatic ticket closing by error) 2010-12-03 2012-01-26 No GlueSACapability defined for WLCG Storage Areas