Tier1 Operations Report 2013-03-06

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 6th March 2013

Review of Issues during the week 27th February to 6th March 2013.
  • A quiet week operationally (although note the current issues section below). An emergency reboot of one of the site routers on Tuesday late afternoon (5th March) did not cause any operational problem.
Resolved Disk Server Issues
  • GDSS648 (LHCbDst) failed in the early hours of Sunday morning (3rd March). A faulty network card was replaced and the system returned to production around midday on Monday (4th March).
Current operational status and issues
  • This morning (Wed 6th March) - intermittent network connectivity problems being investigated.
  • There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
  • We are investigating a higher job failure rate for LHCb and Atlas. This appears to be caused by job set-ups taking a long time. We are investigating running jobs re-niced.
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage). (Anticipate resolution of this during intervention on 12th March).
  • Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network. (Scheduled intervention on 12th March will progress this.)
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
Ongoing Disk Server Issues
  • GDSS594 (GenTape) is still unavailable as it will re-run acceptance testing before being considered for going back into service.
Notable Changes made this last week
  • All remaining tape servers have now been upgraded to Castor 2.1.13-9.
  • The number of nodes behind the SL6 trial batch queue has been increased with a few hundred job slots now available.
  • Disk controller firmware updates in the 2011 Clustervision batch of disk servers (ongoing).
Declared in the GOC DB
  • Site outage on Tuesday 12th March for replacement of core network switch (C300).
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Replace central switch (C300). (Planned for a Tuesday during March). This will:
      • Improve the stack 13 uplink.
      • Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services
    • Upgrade of Site-BDII & WMS from EMI-1 to EMI-2 by end of March.
    • Upgrade of other EMI-1 components (APEL, UI) under investigation.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
    • Electrical safety check (will require some downtime).


Entries in GOC DB starting between 27th February to 6th March 2013.

There were no entries in the GOC DB for last week.

Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
91974 Green Urgent In Progress 2013-03-04 2013-03-04 NAGIOS *eu.egi.sec.EMI-1* failed on lcgwms01.gridpp.rl.ac.uk@RAL-LCG2
91687 Red Less Urgent Waiting Reply 2013-02-21 2013-03-06 epic Support for epic.vo.gridpp.ac.uk VO on WMS
91658 Red Less Urgent In Progress 2013-02-20 2013-02-22 LFC webdav support
91146 Red Urgent In Progress 2013-02-04 2013-03-05 Atlas RAL input bandwith issues
91029 Red Very Urgent On Hold 2013-01-30 2013-02-27 Atlas FTS problem in queryin jobs
90528 Red Less Urgent In Progress 2013-01-17 2013-02-19 SNO+ WMS not assiging jobs to sheffield
86152 Red Less Urgent In Progress 2012-09-17 2013-03-06 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
27/02/13 100 100 97.5 95.9 100 Atlas: Few SRM test timeouts, CMS: Single SRM test timeout.
28/02/13 100 -100 100 100 100 Problem with ALICE monitoring
01/03/13 100 -100 100 95.9 100 Problem with ALICE monitoring, CMS Single SRM test timeout.
02/03/13 100 -100 100 100 100 Problem with ALICE monitoring
03/03/13 100 -100 100 100 100 Problem with ALICE monitoring
04/03/13 100 -100 100 100 100 Problem with ALICE monitoring
05/03/13 100 100 100 95.8 100 Single SRM test failure coincident with network router reboot.