Tier1 Operations Report 2012-01-04

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 4th January 2012

The report covers the Christmas period during which Tier1 operations continued smoothly.

Review of Issues during the two weeks 21st December 2011 to 4th January 2012.

  • During the afternoon of Wednesday 21st December, between 13:09 and 13:45 there was a break in the network link between the Site Access Router (SAR) and the UKLight Router. This affected data transfers that do not go over the OPN.
  • This same link was again down from 01:30 to 08:45 during the morning of Thursday 22nd December. Again data transfers that do not go over the OPN were affected. (Retrospectively added to the GOCDB as an unscheduled warning).
  • On Wednesday/Thursday 28/29th December regular checks detected a high rate of failures for accesses to AliceDisk. Remedial action was taken (The number of xrootd job slots were changed from 50 to 100 to accommodate all requests which were timing out).
  • We have reported two lost files to LHCb in the first couple of days back after the holiday. These were separate incidents. One was picked up by the checksum checker (and followed the failure of GDSS463), the other was picked up following a failing FTS transfer and was recorded as having a size of zero in the Castor database.
  • A problem with CVMFS reported by Atlas that was present over the holiday period was traced to a failure of some replication at CERN.

Resolved Disk Server Issues

  • GDSS332 (LHCbDst- D1T0) which was reported as having failed during the morning of Wednesday 21st December was returned to production during the afternoon of that day.
  • GDSS307 (CMSWanIn - D0T1) failed with a read-only file system on the evening of 26th December. It was returned to production the following morning following an intervention on site.
  • GDSS463 (LHCbDst - D1T0) failed with a read-only file system on the afternoon of 31st December. It was returned to production around lunchtime the following day.

Current operational status and issues.

  • None.

Ongoing Disk Server Issues

  • None.

Notable Changes made this last fortnight

  • None

Forthcoming Work & Interventions

  • Tuesday 10th January: Update of final pair of DNS servers at RAL to new hardware.

Declared in the GOC DB

  • Thursday 5th January: First step of migration of Oracle database infrastructure. This is the move of Castor databases and will cause an outage of Castor for some hours (with a batch drain beforehand.) The opportunity will also be taken to move some racks of older nodes (required to create space for new deliveries) and apply some patches to worker nodes.
  • Monday 9th January. Lcgui01 will be updated to UMD version of UI.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. There is a significant amount of work that will require to be done during the LHC stop at the start of 2012.

  • Infrastructure:
    • Move part of the cooling system onto the UPS supply. (Should not require service interruption.)
  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch Castor and LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.11 upgrade.
    • SRM 2.11 upgrade
    • Replace hardware running Castor Head Nodes.
    • Move to use Oracle 11g.
    • Update to the Castor Information Porvider (CIP).
  • Networking:
    • Changes required to extend range of addresses that route over the OPN.
    • Re-configure networking in UPS room.
    • Install new Routing & Spine layers.
    • Final updates to the RAL DNS infrastructure (two DNS servers still to replace)
  • Fabric:
    • BIOS/firmware updates, equipment moves in machine room (consolidation of equipment in racks; some rack moves.), Othere re-configurations (adding IPMI cards, etc.)
  • Grid Services:
    • Updates of Grid Services (including LB, APEL, batch server) to UMD versions (mainly in new year).
  • VO:
    • Address permissions problem regarding Atlas User access to all Atlas data.
    • Migrate Atlas LFC to CERN.

Entries in GOC DB starting between 21st December 2011 and 4th January 2012.

There was one unscheduled outage during this period. This was during the failure of the SAR-UKLight router network link.

Service Scheduled? Outage/At Risk Start End Duration Reason
All CEs (all batch) SCHEDULED OUTAGE 04/01/2012 20:00 05/01/2012 16:00 20 hours Batch unavailable (with drain beforehand) during intervention on Castor system.
All Castor storage. UNSCHEDULED WARNING 22/12/2011 01:30 22/12/2011 08:45 7 hours and 15 minutes Service degradation at RAL for all SRMs. A network problem caused some file transfers to fail at RAL.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
77739 Red Less urgent In Progress 2011-12-25 2011-12-25 CMS [sr #125424] T1_UK_RAL Job Robot error
77528 Red Less urgent In Progress 2011-12-16 2011-12-16 H1 hone jobs cannot be submitted through lcgwms03.gridpp.rl.ac.uk wms-server
77026 Red Less Urgent On Hold 2011-12-05 2011-12-15 BDII
74353 Red very urgent In Progress 2011-09-16 2011-12-07 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2011-12-15 Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077 Red less urgent in progress 2011-02-28 2011-09-20 Mandatory WLCG InstalledOnlineCapacity not published
64995 Red less urgent in progress 2010-12-03 2011-09-20 No GlueSACapability defined for WLCG Storage Areas