Tier1 Operations Report 2012-03-21

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 21st March 2012

Review of Issues during the week 14th to 21st March 2012.

  • There was a significant outage of the Tier1 on Friday 16th March owing to problems on the Tier1 network. Network changes when adding a new node seems to have caused a packet storm. The site routers disconnected the Tier1 and a number of our network stacks had problems. In particular one stack failed to recover until a faulty switch was identified (and removed). This incident is being post-mortemed. The incident started at 09:30 and was not completely over until 15:50.

Resolved Disk Server Issues

  • None

Current operational status and issues.

  • Work is ongoing to get to the root of the problems that affect the network link between the UKlight and SAR routers. The last break was over a week ago (on Tuesday 13th March). Various components in the affected routers have been changed, although it is too soon to say of the problem is finally resolved.
  • We have two incidents where the FTS has failed since it was upgraded to version 2.2.8 on Tuesday (6th March). These were on Wednesday 14th and overnight 16/17th. These are actively under investigation with an Oracle Database patch applied this morning (21st March). These and other issues (including agents crashing) have been sent to the developers who have supplied updated RPMs and the FTS was patched just before the start of this meeting. As part of the investigations some of the FTS monitoring was switched off temporarily as this was seen to add load to the FTS database.

Ongoing Disk Server Issues

  • GDSS392 (CMSTape D0T1) was taken out of production on Sunday evening (18th March). It is currently undergoing tests.

Notable Changes made this last week

  • All tape servers have now been updated to Castor 2.1.11-8 with the improved tape drivers.
  • All worker nodes now have the Castor 2.1.11 clients installed.
  • Castor SRMs have had latest OS patches applied (Tuesday 20th March)
  • The disk array used in the standby Castor Databases has been replaced to the one required for its final configuration (Tuesday/Wednesday 20/21 Mar).

Forthcoming Work & Interventions

  • The second batch of worker nodes are expected to go into production within a few weeks.
  • A further intervention on a power board supplied by the UPS will be needed. This will lead to a very low risk intervention on Tuesday 27th March.

Declared in the GOC DB

  • Wednesday 28th March - Upgrade of MyProxy to UMD version.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.

  • Databases:
    • Regular Oracle "PSU" patches are pending for SOMNUS (LFC & FTS).
    • Switch LFC/FTS/3D to new Database Infrastructure.
    • Update LFC/FTS databases to Oracle 11.
  • Castor:
    • Update the Castor Information Provider (CIP) (Need to re-schedule.)
    • Move to use Oracle 11g (requires a minor Castor update.)
  • Networking:
    • Install new Routing & Spine layers for Tier1 network.
    • Main RAL network updates - early summer.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.

Entries in GOC DB starting between 14th and 21st March 2012.

There were eight unscheduled entries in the GOC DB for this last week. These relate to the Tier1 site outage on Friday (16th) and the problems with the FTS (both reported above).

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgfts, lfc.gridpp.rl.ac.uk UNSCHEDULED WARNING 21/03/2012 10:00 21/03/2012 12:00 2 hours Warning (At Risk) while applying patch to Oracle database to fix ongoing problem with FTS.
All Castor SRMs SCHEDULED WARNING 20/03/2012 11:00 20/03/2012 14:00 3 hours Application of OS patches to SRM nodes.
lcgfts UNSCHEDULED WARNING 20/03/2012 09:30 20/03/2012 15:30 6 hours Warning (At Risk) During investigation of ongoing problem on the FTS.
lcgfts UNSCHEDULED WARNING 17/03/2012 10:00 19/03/2012 12:00 2 days, 2 hours FTS at risk due to possible issues with the backend database
lcgfts UNSCHEDULED OUTAGE 17/03/2012 00:00 17/03/2012 10:00 10 hours FTS downtime due to problems with the backend database
Batch (All CEs) and srm-atlas, srm-cms, srm-lhcb UNSCHEDULED OUTAGE 16/03/2012 13:00 16/03/2012 15:50 2 hours and 50 minutes Following Network Problems many services are back. However, problems still affect Castor for Atlas, CMS and LHCb and batch services.
Whole Site UNSCHEDULED OUTAGE 16/03/2012 11:00 16/03/2012 13:05 2 hours and 5 minutes Site Outage following network problems. Whilst we have largely recovered from the internal network problem we have to systematically check services.
Whole Site UNSCHEDULED OUTAGE 16/03/2012 09:30 16/03/2012 11:00 1 hour and 30 minutes Outage on whole site while we investigate network problems
lcgfts.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 14/03/2012 15:00 14/03/2012 18:30 3 hours and 30 minutes FTS downtime due to problems with the backend database.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
80471 Green Urgent In Progress 2012-03-21 2012-03-21 Atlas Error with credential in FTS transfers to/from UKI-LT2-QMUL
80119 Green Less Urgent Waiting Reply 2012-03-12 2012-03-21 SNO+ ROOT build failing
79428 Red Less Urgent Waiting Reply 2012-02-21 2012-03-19 SNO+ glite-wms-job aborted
68853 Red Less Urgent On hold 2011-03-22 2012-03-12 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)