Tier1 Operations Report 2013-02-06

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 6th February 2013

Review of Issues during the week 30th January to 6th February 2013.
  • There was a problem with the Atlas Castor during the night/morning of Thursday 31st January. This was traced to a single unresponsive disk server. Rebooting the server fixed the problem.
Resolved Disk Server Issues
  • GDSS644 (AtlasScratchDisk D1T0) was found to be responding very slowly on Thursday (31st Jan) and causing problems for the Atlas Castor instance and was rebooted.
Current operational status and issues
  • There has been an intermittent problem over the last couple of days (5/6 Feb) with the start rate for batch jobs that is being investigated.
  • There is a GGUS ticket for a problem seen by the FTS that is caused by a problem within the Castor SRM.
  • The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
  • Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • System set-up for participation in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place. Currently being tested by Atlas.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • On Monday (4th February) the upgrading of the Top-BDII to newer systems running SL6/EMI-2 was completed. There are now three systems in the top-bdii alias.
  • H1 have been added to the CVMFS system for smaller VOs.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Replace central switch (C300). (Tentative date 5th March, but Atlas would like earlier). This will:
      • Improve the stack 13 uplink.
      • Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Removal of AFS clients from Worker Nodes.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.


Entries in GOC DB starting between 30th January and 6th February 2013.

None

Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
91152 Green Less Urgent In Progress 2013-02-04 2013-02-04 CMS RAL tape migration
91146 Green Urgent In Progress 2013-02-04 2013-02-05 Atlas RAL input bandwith issues
91060 Yellow Less Urgent On Hold 2013-01-31 2013-02-01 CMS glexec issues on a subset of worker nodes
91029 Red Very Urgent In Progress 2013-01-30 2013-02-06 Atlas FTS problem in queryin jobs
90528 Red Less Urgent In Progress 2013-01-17 2013-02-04 SNO+ WMS not assiging jobs to sheffield
90151 Red Less Urgent Waiting Reply 2013-01-08 2013-02-04 NEISS Support for NEISS VO on WMS
89733 Red Urgent In Progress 2012-12-17 2013-02-04 RAL bdii giving out incorrect information
86152 Red Less Urgent On Hold 2012-09-17 2013-01-16 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
30/01/13 100 100 94.9 100 100 Multiple failures "unable to delete file from SRM", plus one 'user timeout' failure.
31/01/13 100 100 90.1 100 100 Atlas Castor instance showing lots of timeouts. Traced to a single disk server that was very unresponsive. Reboot of disk server fixed it.
01/02/13 100 92.3 100 100 100 330 min timeout for the job exceeded. Cancelling the job.
02/02/13 100 100 98.5 100 100 Single SRM test failure - unable to delete file from SRM
03/02/13 100 100 98.2 100 100 One user timeout, one failure to delete file.
04/02/13 100 100 100 100 100
05/02/13 100 100 100 100 100