Tier1 Operations Report 2013-02-20

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 20th February 2013

Review of Issues during the week 13th to 20th February 2013.
  • It was not possible to recover the RAID array on GDSS594 (GenTape) following the double drive failure on 12the Feb. 68 files which had not been migrated to tape at the time of the problem have been declared lost. All of these belonged to T2K.
  • There was a shut down of power in the Atlas Building over the last weekend (16/17 Feb) for safety checks. This had no effect on the Tier1.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • There have been intermittent problems over the past fortnight with the start rate for batch jobs. These are still being investigated.
  • We have had our batch system put offline by Atlas intermittently over the last week following test failures. We are also investigating a higher job failure rate for LHCb. (These problems may be related).
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
  • Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place. Currently being tested by Atlas.
Ongoing Disk Server Issues
  • Following the loss of data from GDSS594 (GenTape) referred to above it is having the RAID array rebuilt ahead of re-running acceptance testing before being considered for going back into service.
Notable Changes made this last week
  • Wed 13th Feb: AFS clients stopped on Worker Nodes.
Declared in the GOC DB
  • Tuesday 26th February: Warning on Site for an hour during central network intervention that wall cause two short breaks in external connectivity via the firewall. (Will drain FTS ahead of this).
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Replace central switch (C300). (Anticipated for a Tuesday during March). This will:
      • Improve the stack 13 uplink.
      • Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.


Entries in GOC DB starting between 13th and 20th February 2013.

None

Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
91146 Red Urgent In Progress 2013-02-04 2013-02-12 Atlas RAL input bandwith issues
91029 Red Very Urgent In Progress 2013-01-30 2013-02-18 Atlas FTS problem in queryin jobs
90528 Red Less Urgent Waiting Reply 2013-01-17 2013-02-19 SNO+ WMS not assiging jobs to sheffield
90151 Red Less Urgent In Progress 2013-01-08 2013-02-04 NEISS Support for NEISS VO on WMS
86152 Red Less Urgent On Hold 2012-09-17 2013-01-16 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
13/02/13 100 100 100 100 100
14/02/13 100 100 100 100 100
15/02/13 100 100 100 100 100
16/02/13 100 100 100 100 100
17/02/13 100 -100 100 100 100 Problem with ALICE's monitoring.
18/02/13 100 100 100 100 100
19/02/13 100 100 100 100 100