Tier1 Operations Report 2012-03-07

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 7th March 2012

Review of Issues during the week 29th February to 7th March 2012.

  • The Castor database saw very heavy load from Atlas during the early hours of Saturday morning (3rd March) that looks to have been linked to a particular workload.
  • On Saturday (3rd March) there was a failure of the network link between the UKlight and SAR routers (this affects transfers to/from Tier2s) that lasted a bit under an hour around lunchtime. A fibre transceiver was replaced in this link on Tuesday morning (6th March) while the FTS down for a scheduled intervention.

Resolved Disk Server Issues

  • GDSS513 (LHCbDst - D1T0) was removed from production on Wednesday 29th Feb. following multiple drive failures. It was returned to service during the next morning (1st March.)

Current operational status and issues.

  • There is a known issue with the Atlas SRMs which is being investigated. A patched version has been rolled out that provides a workaround for the problem and stops the SRMs crashing. The remaining impact of this problem is minimal.

Ongoing Disk Server Issues

  • None.

Notable Changes made this last week

  • Thursday (1st March) One of the two batches of new worker nodes was moved to production.
  • Tuesday (6th March) Castor databases moved to final hardware configuration for main database, with Oracle Data Guard enabled to synchronize updates to the backup database.
  • Tuesday (6th March) FTS update to version 2.2.8 (still using Oracle 10).
  • Tuesday (6th March) Applied network routing change required to extend range of addresses that route over the OPN.

Forthcoming Work & Interventions

  • The Tier1 internal mail server ("Pat") will be replaced in the next couple of weeks.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.

  • Databases:
    • Regular Oracle "PSU" patches are pending for SOMNUS (LFC & FTS).
    • Switch LFC/FTS/3D to new Database Infrastructure.
    • Update LFC/FTS databases to Oracle 11.
  • Castor:
    • Update the Castor Information Provider (CIP) (Need to re-schedule.)
    • Move to use Oracle 11g (requires a minor Castor update.)
  • Networking:
    • Install new Routing & Spine layers for Tier1 network.
    • Main RAL network updates - early summer.
  • Fabric:
    • BIOS/firmware updates, Other re-configurations (adding IPMI cards, etc.)
  • Grid Services:
    • Updates of Grid Services (including WMS, MyProxy, LFC front ends) to EMI/UMD versions.

Entries in GOC DB starting between 29th February and 7th March 2012.

There were no unscheduled outages during this period.

Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor & CEs (batch). SCHEDULED OUTAGE 06/03/2012 10:00 06/03/2012 13:30 3 hours and 30 minutes Castor Outage During Migration of Castor Oracle Databases to new hardware.
lcgfts.gridpp.rl.ac.uk, SCHEDULED OUTAGE 06/03/2012 08:00 06/03/2012 13:05 5 hours and 5 minutes Upgrade to FTS 2.2.8. Will include starting with a fresh database so all channels drained and any transfers waiting in the ready queue will be lost.
castor GEN instance: srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k SCHEDULED OUTAGE 29/02/2012 08:00 29/02/2012 12:00 4 hours Update of GEN Castor instance to version 2.1.11-8

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
79867 Green Less Urgent In Progress 2012-03-04 2012-03-07 SNO+ snoplus.snolab.ca LFC
79428 Red Less Urgent Waiting Reply 2012-02-21 2012-03-07 SNO+ glite-wms-job aborted
74353 Red Very Urgent In Progress 2011-09-16 2012-03-02 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2012-03-02 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)