Tier1 Operations Report 2013-03-13

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 13th March 2013

Review of Issues during the week 6th to 13th March 2013.
  • On Wednesday & Thursday (6/7 March) there were problems on the RAL network. One "Outage" plus two "Warnings" were declared in the GOC DB. The problems caused intermittent breaks in the Tier1's connectivity - mainly to the outside world but also to the rest of RAL. The core networking team found and fixed these problems on the Thursday afternoon.
  • On Monday (11th) there was a problem with tape migration traced to a single corrupt file on a disk server. The data loss has been reported to T2K.
  • The planned network intervention yesterday (Tues 12th March) overran significantly. A total of a 12-hour outage resulted (almost double that planned). Furthermore, following problems, the network uplink to the UKLight router is now running on a single 10Gbit link, rather than a pair of such links.
Resolved Disk Server Issues
  • GDSS594 (GenTape), which failed a few weeks ago, has now been retired from service and will be used for spares.
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
  • We are investigating a higher job failure rate for LHCb and Atlas. This appears to be caused by job set-ups taking a long time. yesterday (12th Mar) we made a change to run jobs re-niced, although initial results suggest this has not fixed this problem.
  • Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network. Awaiting confirmation of the effect of the changes yesterday (12th Mar).
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
Ongoing Disk Server Issues
  • GDSS519 (GenTape) was put into a draining mode following discovery of a single corrupt file. Following the migration of all remaining files it has been taken out of production to be checked out.
Notable Changes made this last week
  • The core network switch in the Tier1 Network has been replaced (Tuesday 12th March) providing more ports for network expansion.
  • During the network change (Tuesday 12th March) the uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage) was doubled in capacity (from 2*10Gbit to 4*10Gbit) to resolve a bottleneck
  • Batch queue parameters were modify to run jobs on the worker nodes re-niced.
  • The RAL site BDIIs have been upgraded to EMI-2.
  • New EMI-2 WMS nodes (lcgwms04, lcgwms05, lcgwms06) have been added into production; the old EMI-1 ones (lcgwms01, 02, 03) will be drained and retired shortly (anyway by end March)
  • The Castor client software has been upgraded to version 2.1.13 on one batch of worker nodes.
  • Updating disk controller firmware on Clustervision '11 batch of disk servers ongoing.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • A change to the certificate used by the MyProxy server will be introduced on Monday (18th Mar).
  • A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services
    • Upgrade of other EMI-1 components (APEL, UI) under investigation.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
    • Electrical safety check (will require some downtime).


Entries in GOC DB starting between 6th to 13 March 2013.

There were five unscheduled entries in the GOC DB for last week. Three of these (one "Outage", two "Warnings") for the RAL networking problems on Wed/Thu 6/7 March. The other two were unscheduled extensions to a planned downtime to restructure the Tier1 network on 12th March.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site UNSCHEDULED OUTAGE 12/03/2013 19:00 12/03/2013 21:00 2 hours The main work on our network is over however it is taking a little time to restore services. Unfortunately it is therefore necessary to make a small further extension to our downtime.
Whole Site UNSCHEDULED OUTAGE 12/03/2013 15:30 12/03/2013 19:00 3 hours and 30 minutes Extending Outage as some problems encountered during the intervention to reconfigure the core of the Tier1's network.
Whole Site SCHEDULED OUTAGE 12/03/2013 08:45 12/03/2013 15:30 6 hours and 45 minutes Reconfiguration of core network within the RAL Tier1. Storage (Castor) services will be stopped. LFC stopped. FTS and Batch drained of active transfers/jobs. Other services (e.g. BDII) may see some short breaks in connectivity.
Whole Site UNSCHEDULED WARNING 07/03/2013 10:00 07/03/2013 16:30 6 hours and 30 minutes Some network issues ongoing and under investigation.
Whole Site UNSCHEDULED WARNING 06/03/2013 15:00 07/03/2013 10:00 19 hours At risk while recovering from network outage.
Whole Site UNSCHEDULED OUTAGE 06/03/2013 09:30 06/03/2013 15:00 5 hours and 30 minutes Network outage at the RAL tier1
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
92459 Green Less Urgent In Progress 2013-03-12 2013-03-13 EPIC LFC support for epic.vo.gridpp.ac.uk VO
92266 Amber Less Urgent In Progress 2013-03-06 2013-03-08 Certificate for RAL myproxy server
91974 Red Urgent In Progress 2013-03-04 2013-03-04 NAGIOS *eu.egi.sec.EMI-1* failed on lcgwms01.gridpp.rl.ac.uk@RAL-LCG2
91687 Red Less Urgent In Progress 2013-02-21 2013-03-06 epic Support for epic.vo.gridpp.ac.uk VO on WMS
91658 Red Less Urgent In Progress 2013-02-20 2013-02-22 LFC webdav support
91146 Red Urgent In Progress 2013-02-04 2013-03-05 Atlas RAL input bandwith issues
91029 Red Very Urgent On Hold 2013-01-30 2013-02-27 Atlas FTS problem in queryin jobs
86152 Red Less Urgent On Hold 2012-09-17 2013-03-06 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
06/03/13 97.3 100 99.2 84.5 91.8 Network problems affecting RAL
07/03/13 100 100 77.3 83.4 83.4 Network problems affecting RAL
08/03/13 100 100 100 100 100
09/03/13 100 100 100 100 100
10/03/13 100 100 100 95.9 100 Single SRM Test failure - User timeout.
11/03/13 100 100 100 100 100
12/03/13 44.2 25.2 43.4 43.4 41.7 Planned network update (C300 replacement) which overran