Tier1 Operations Report 2013-04-10

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 10th April 2013

The Post Mortem review of the failure of disk server GDSS594 (GenTape) in February that led to the loss of 68 T2K files has been completed. This can be seen at:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20130219_Disk_Server_Failure_File_Loss

Review of Issues during the week 3rd to 10th April 2013.
  • On Tuesday morning, 9th April, a planned intervention on the site networking ran into problems. The RAL site was disconnected from the rest of the world for around 100 minutes. The intervention had previously been announced as a scheduled 'Warning' in the GOCDB and the FTS drained. Internally Tier1 services carried on OK during the external break.
  • Two files were declared lost to Atlas following the failure of disk server GDSS454.
Resolved Disk Server Issues
  • GDSS454 (AtlasDataDisk D1T0) failed with a Red Only file system on Sunday 7th April. Following checks it was returned to service on Monday (8th). Two files that were being written at the time of the failure were declared lost to Atlas.
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
  • The problem LHCb and Atlas jobs failing due to long job set-up times remains and investigations continue.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
  • There is an outstanding problem (and GGUS ticket) affecting the cerificate on the MyProxy server.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • Updating disk controller firmware on Clustervision '11 batch of disk servers ongoing. (LHCb servers done this week).
  • Kernel/errata updates and removal of AFS software (as opposed to just disabling) being done across worker nodes.
  • New disk servers deployed in production (540TB to AtlasDataDisk; 720TB to CMSDisk).
  • One of the two batches of new worker nodes (the one from OCF) have been deployed into production.
Declared in the GOC DB
  • This evening (Wed 10th March 18:00 - 23:59 BST) Emergency maintenance affecting both the main and backup links to CERN. Site declared as 'Warning'.
  • Tomorrow (Thursday 11th April) Outage of LFC and FTS services (10:00 - 12:00). The Oracle database behind these services uses two disk arrays. One of the arrays is reporting errors and the database will be reconfigured (rebalanced) to move the data off the faulty array. FTS transfers will be drained before the outage.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • One of the disk arrays hosting the LFC/FTS/3D databases has given some errors. An intervention to move the 'somnus' (LFC & FTS) data off this array is planned for tomorrow. A further intervention will be required on the array itself which will affect the Atlas 3D service.
  • A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing. (Alice disk servers in this batch remain to be done).

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services
    • Upgrade of other EMI-1 components (APEL, UI) under investigation.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
    • Electrical safety check (will require some downtime).


Entries in GOC DB starting between 3rd and 10th April 2013.

There was one unscheduled outage (for the problematic network intervention) and one unscheduled warning (for emergency maintenance on the CERN OPN links) entries in the GOC DB for the last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site SCHEDULED WARNING 10/04/2013 18:00 11/04/2013 00:00 6 hours An emergency maintenance has been announced for both the main and backup OPN links RAL - CERN.
Whole Site UNSCHEDULED OUTAGE 09/04/2013 07:45 09/04/2013 09:25 1 hour and 40 minutes Problem during planned network intervention broke connectivity to site. (Retrospective addition to GOC DB. Intervention originally delared as a Warning.)
Whole Site SCHEDULED WARNING 09/04/2013 07:30 09/04/2013 08:30 1 hour At Risk around two short (few minute) breaks in external connectivity to the RAL Tier1 required for a network upgrade. Will drain FTS for an hour beforehand as a precaution.
Whole Site UNSCHEDULED WARNING 03/04/2013 18:00 04/04/2013 00:00 6 hours An energency maintenance has been announced for both the main and backup OPN links RAL - CERN. No outage is expected during this maintenance. Services are considered at risk only.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
93149 Green Less Urgent On Hold 2013-04-05 2013-04-08 Atlas RAL-LCG2: jobs failing with " cmtside command was timed out"
93136 Yellow Less Urgent In Progress 2013-04-05 2013-04-05 EPIC Problems downloading job output using RAL WMS (epic VO)
92266 Red Less Urgent In Progress 2013-03-06 2013-04-09 Certificate for RAL myproxy server
91658 Red Less Urgent On Hold 2013-02-20 2013-04-03 LFC webdav support
91029 Red Very Urgent On Hold 2013-01-30 2013-02-27 Atlas FTS problem in queryin jobs
86152 Red Less Urgent On Hold 2012-09-17 2013-03-19 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
20/03/13 100 100 100 100 100
03/04/13 100 100 100 100 99.3 Job cancelled/purged.
04/04/13 100 100 99.2 95.9 100 Atlas: Single SRM test failure "User timeout". CMS: Single SRM test failure "User timeout".
05/04/13 100 100 100 100 100
06/04/13 100 100 100 99.4 100 Single SRM test failure "User timeout" at very end of day (main effect is on the 7th).
07/04/13 100 100 100 92.5 100 Single SRM test failure "User timeout" at very end of day (main effect is on the 7th).
08/04/13 100 100 99.1 87.7 100 Atlas: 1 * "could not open connection to srm-atlas.gridpp.rl.ac.uk"; CMS: Total of three SRM test failures. 1 * "could not open connection to srm-cms.gridpp.rl.ac.uk"; 2 * "User timeout".
09/04/13 91.7 100 92.7 90.4 93.4 Problem during planned central networking intervention disconnected site for around 100 minutes.