Tier1 Operations Report 2013-02-27

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 27th February 2013

Review of Issues during the week 20th to 27th February 2013.
  • Overnight Wed/Thu (20/12 Feb) there was a problem with the Castor tape robot. This was resolved during the next day. There was no significant operational impact.
  • On Saturday evening (23rd Feb) there was a problem with the Atlas Castor instance that lasted a few hours and was fixed by the Castor on-call.
  • There was planned network intervention yesterday (Tuesday) morning for which a 'warning' was scheduled in the GOC DB and the FTS drained. Rather than the expected two short (few minute) breaks in connectivity the external network connectivity was down for around 30 minutes. Apart from the planned stop of the FTS all services carried on running OK internally.
Resolved Disk Server Issues
  • GDSS447 (Atlas DataDisk) failed with a read only filesystem in the early hours of Monday (25th Feb). It was returned to production at the end of that afternoon.
Current operational status and issues
  • There have been intermittent problems over the past few weeks with the start rate for batch jobs. These are still being investigated.
  • We are investigating a higher job failure rate for LHCb and Atlas. This appears to be caused by job set-ups taking a long time.
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
  • Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
Ongoing Disk Server Issues
  • GDSS594 (GenTape) is still unavailable as it will re-run acceptance testing before being considered for going back into service.
Notable Changes made this last week
  • On Friday (22nd Feb) a minor change was made to the FTS configuration for some channels (mainly from UK Tier2s to us) in response to a low level of failures owing to a short timeout.
  • During the last week a number of tape servers have been upgraded to Castor 2.1.13-9.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Tuesday 12th March: Outage for replacement of core network switch (C300).
  • A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is being undertaken.
  • The number of nodes behind the SL6 trial batch queue will be increased (by around 450 job slots) by adding new CPU nodes in.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Replace central switch (C300). (Anticipated for a Tuesday during March). This will:
      • Improve the stack 13 uplink.
      • Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.


Entries in GOC DB starting between 20th to 27th February 2013.

There were no unscheduled entries in the GOC DB for last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole site SCHEDULED WARNING 26/02/2013 07:30 26/02/2013 08:30 1 hour At Risk around two short (few minute) breaks in external connectivity to the RAL Tier1. Will drain FTS for an hour beforehand as a precaution.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
91687 Amber Less Urgent In Progress 2013-02-21 2013-02-21 epic Support for epic.vo.gridpp.ac.uk VO on WMS
91658 Amber Less Urgent In Progress 2013-02-20 2013-02-22 LFC webdav support
91146 Red Urgent In Progress 2013-02-04 2013-02-12 Atlas RAL input bandwith issues
91029 Red Very Urgent On Hold 2013-01-30 2013-02-27 Atlas FTS problem in queryin jobs
90528 Red Less Urgent Waiting Reply 2013-01-17 2013-02-19 SNO+ WMS not assiging jobs to sheffield
90151 Red Less Urgent Waiting Reply 2013-01-08 2013-02-27 NEISS Support for NEISS VO on WMS
86152 Red Less Urgent On Hold 2012-09-17 2013-01-16 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
20/02/13 100 100 99.2 100 100 Single User timeout, failure to put a file. Investigations show that the problem was within Castor, although not understood in detail.
21/02/13 100 100 97.4 100 100 A few failures of SRM test. Investigations suggest that a couple of them were due to the test itself.
22/02/13 100 100 97.3 100 100 Few failures of the SRM 'Put' test.
23/02/13 100 100 93.8 100 100 Problem with Atlas Castor instance fixed by on-call.
24/02/13 100 100 100 100 100
25/02/13 100 100 100 100 100
26/02/13 100 100 96.3 95.9 95.8 Failures of SRM tests triggered by scheduled network intervention.