Tier1 Operations Report 2013-02-13

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 13th February 2013

Review of Issues during the week 6th to 13th February 2013.
  • There was a low level SRM problem that caused Atlas to put RAL offline for brief periods.
  • Maintenance of the R89 machine room air conditioning was completed on 07/02/2013.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • There have been intermittent problems over the past week with the start rate for batch jobs. This is being investigated.
  • There is a GGUS ticket for a problem seen by the FTS that is caused by a problem within the Castor SRM.
  • The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this is in place.
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
  • Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • System set-up for participation in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place. Currently being tested by Atlas.
Ongoing Disk Server Issues
  • gdss594 (GenTape) suffered a double drive failure last night (12/02/2013). Fabric are currently investigating.
Notable Changes made this last week
  • Today, 13th February: Stopping AFS client on Worker Nodes.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Replace central switch (C300). (Tentative date 5th March, but Atlas would like earlier). This will:
      • Improve the stack 13 uplink.
      • Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
    • Update core Tier1 network and change connection to site and OPN including:
      • Core networking has informed us that they need to re-configure a core switch on 26/02/2013 between 07:30 and 08:30
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Removal of AFS clients from Worker Nodes.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.


Entries in GOC DB starting between 6th and 13th February 2013.

None

Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
91251 Red top priority In Progress 2013-02-07 2013-02-07 lhcb CEs don't seem to be running jobs
91146 Red Urgent In Progress 2013-02-04 2013-02-12 Atlas RAL input bandwith issues
91029 Red Very Urgent In Progress 2013-01-30 2013-02-11 Atlas FTS problem in queryin jobs
90528 Red Less Urgent In Progress 2013-01-17 2013-02-04 SNO+ WMS not assiging jobs to sheffield
90151 Red Less Urgent Waiting Reply 2013-01-08 2013-02-04 NEISS Support for NEISS VO on WMS
86152 Red Less Urgent On Hold 2012-09-17 2013-01-16 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
06/02/13 100 100 100 100 100
07/02/13 100 100 100 100 100
08/02/13 100 100 100 100 100
09/02/13 100 100 99.2 100 100 User timeout, failure to put a file and subsequent failure to delete it.
10/02/13 100 100 100 100 100
11/02/13 100 100 100 100 100
12/02/13 100 100 100 100 100