Tier1 Operations Report 2013-01-30

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 30th January 2013

Review of Issues during the week 23rd to 30th January 2013.
  • The work on the main site power supply has been completed. This started last June and one half of the switchgear was brought into service on 17 September. The work on the second half has been completed and was brought into use on Monday (28th Jan). This restores resilience in this part of the site power supply.
Resolved Disk Server Issues
  • GDSS594 (GenTape - D0T1) Was taken out of production Tuesday (22nd Jan) with multiple disk failures. It was returned to service on Thursday (24th).
  • GDSS433 (AtlasDataDisk - D1T0) failed with a read only filesystem on Friday (25th Jan). It was returned tos ervice on Sunday (27th).
Current operational status and issues
  • The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
  • Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • Problems with the Top-BDII are seen and known to cause problems (the daemon restarts). The rolling upgrade of the Top-BDII is underway.
  • System set-up for participation in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • Thurs (24th Jan) The batch farm was configured to have access to CVMFS areas for na62 and mice.
  • On Tuesday (29th Jan) the Argus server was upgraded to EMI-2/SL6.
  • On Tuesday (29th Jan) the RAL status page (http://www.gridpp.rl.ac.uk/status/) was modified to shows tape usage information.
  • On Wednesday (30th Jan) and upgraded version of the maui batch scheduler was installed.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Replace central switch (C300). This will:
      • Improve the stack 13 uplink.
      • Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
    • Upgrade Top-BDIIs to latest (EMI-2) version on SL6.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.


Entries in GOC DB starting between 23rd and 30th January 2013.

There were no unscheduled entries in the GOC DB for this period.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgce01, lcgce02, lcgce04, lcgce10, lcgce11 SCHEDULED WARNING 29/01/2013 10:00 29/01/2013 12:00 2 hours Update of Argus Server to SL6/EMI-2
Whole Site SCHEDULED WARNING 28/01/2013 08:00 28/01/2013 20:00 12 hours Following completion of work on the power supply to RAL new equipment will be switched in. This will be in parallel with the existing equipment and re-enables redundancy in the tranformer/switchgear.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
90995 Green Less Urgent In Progress 2013-01-29 2013-01-30 CMS Stageout errors for single workflow at RAL
90986 Green Urgent In Progress 2013-01-29 2013-01-29 NA62 FTS channell BELGRID-UCL to RAL-LCG2 for na62
90844 Green Less Urgent In Progress 2013-01-26 2013-01-28 LFC for cernatschool.org
90528 Red Less Urgent In Progress 2013-01-17 2013-01-17 SNO+ WMS not assiging jobs to sheffield
90151 Red Less Urgent Waiting Reply 2013-01-08 2013-01-24 NEISS Support for NEISS VO on WMS
89733 Red Urgent In Progress 2012-12-17 2013-01-21 RAL bdii giving out incorrect information
86152 Red Less Urgent On Hold 2012-09-17 2013-01-16 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
23/01/13 100 100 100 100 100
24/01/13 87.5 100 100 100 100 Failed the Site-BDII test as the SL6 test CE had wrong string in GlueHostOperatingSystemName.
25/01/13 100 100 100 100 100
26/01/13 100 100 68.6 100 100 The CE test jobs did not run within the time allowed. Problem of hitting maximum number of AtlasSGM jobs. These were queued behind SL6 Atlas S/W validation jobs.
27/01/13 100 100 96.2 100 100 Four SRM test failures - unable to delete file from SRM.
28/01/13 100 100 100 100 100
29/01/13 100 100 77.4 100 100 Repeat of problem of 26/1/13. Fix to batch scheduler did not work and the jobs queued behind more SL6 S/W validation jobs.