Tier1 Operations Report 2013-05-15

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 15th May 2013

Review of Issues during the week 8th to 15th May 2013.
  • There have been two occasions during the last week when the OPN link to CERN has failed over to the backup route (Wed 8 May and around midnight Sat/Sun 11/12 May). In each case the link switched back to the primary after two to three hours and had no operational effect.
  • A normally routine swap of a failed fan in a disk array took down the standby Castor databases for a while last Friday (10 May). This did not affect operations.
  • There has been a high rate of outbound traffic saturating the uplink (currently 10Gbit) for the last couple of days. Investigations show this is predominantly Atlas traffic to many different sites.
Resolved Disk Server Issues
  • None.
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The problem LHCb and Atlas jobs failing due to long job set-up times remains and investigations continue.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
  • There is an outstanding problem (and GGUS ticket) affecting the certificate on the MyProxy server.
Ongoing Disk Server Issues
  • None.
Notable Changes made this last week
  • On Wednesday (8 May) the Castor primary and standby databases were swapped over and Oracle Data Guard re-established between them.
  • On Thursday (9 May) seven new disk servers (630TB) were added to AtlasDataDisk.
  • A further six disk servers have been added to LHCbDst today (Wed 15 May).
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Re-establishing the paired (2*10Gbit) link to the UKLight router. (Aiming to do in next weeks).
  • Tuesday 21st May - Planned networking intervention at RAL.
  • The blocking issue regarding the Castor 2.1.13 upgrade has been resolved and the scheduling of this upgrade will proceed. (The non-Tier1 'Facilities' Castor instance has already been successfully upgraded.)

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Grid Services
    • Testing of alternative batch systems (SLURM, Condor).
    • Upgrade of one remaining EMI-1 component (UI) being planned.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
    • Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
Entries in GOC DB starting between 8th and 15th May 2013.

There was one unscheduled outages during the last week for lcgce12 (CE for test SL6 queue).

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgce12.gridpp.rl.ac.uk UNSCHEDULED OUTAGE 10/05/2013 16:00 10/05/2013 16:51 51 minutes HW failure
All Castor & Batch (CEs) SCHEDULED OUTAGE 08/05/2013 10:00 08/05/2013 12:00 2 hours Stop of Castor storage system while primary and standby databases are switched over. During the stop no batch jobs will be started. Batch work already running may be paused (depending on the VO).


Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
94049 Green Urgent In Progress 2013-05-14 2013-05-15 OPS NAGIOS *eu.egi.sec.Argus-EMI-1* failed on lcgargus01.gridpp.rl.ac.uk@RAL-LCG2
93870 Red Less Urgent In Progress 2013-05-06 2013-05-07 CMS T1_UK_RAL squid upgrade
93149 Red Less Urgent On Hold 2013-04-05 2013-04-13 Atlas RAL-LCG2: jobs failing with " cmtside command was timed out"
92266 Red Less Urgent Waiting for Reply 2013-03-06 2013-04-16 Certificate for RAL myproxy server
91658 Red Less Urgent On Hold 2013-02-20 2013-04-03 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-03-19 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
08/05/13 91.7 84.7 91.7 84.7 84.7 Scheduled Castor outage for the switch of the primary / standby databases.
09/05/13 100 100 100 100 100
10/05/13 100 100 100 100 100
11/05/13 100 100 98.2 100 100 SRM Put test failed with zero number of replicas.
12/05/13 100 100 100 100 100
13/05/13 100 94.0 100 100 100 Tests failed during restart of pbs_server (batch server).
14/05/13 100 100 100 100 100