Tier1 Operations Report 2013-08-14

From GridPP Wiki
Revision as of 16:08, 14 August 2013 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 14th August 2013

Review of Issues during the week 7th to 14th August 2013.
  • There has been a high rate of timeouts on the CMS Castor instance. These are limited to CMS-tape and may be a consequence of draining a large number of disk servers in that service class. This is being followed up.
Resolved Disk Server Issues
  • GDSS664 (AtlasDataDisk - D1T0) was put back into production yesterday. This server had been been drained following multiple disk drive problems. It was originally removed from service on the 11th July.
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The FTS3 testing has continued very actively. Atlas have moved the UK, German and French clouds to use it. Problems with FTS3 are being uncovered during these tests. One such issue required an urgent patch which was applied yesterday (Tuesday 13th Aug).
  • We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
  • Testing is ongoing with the proposed new batch system (ARC-CEs, Condor, SL6). Atlas and CMS running work through this. ALICE, LHCb & H1 being brought on-board with the testing.
  • Atlas have reported a problem with file deletions going slow. This is being investigated. The problem seems to also affect the RAL Tier2.
Ongoing Disk Server Issues
  • None
Notable Changes made this last fortnight.
  • The list of non-LHC VOs enabled on the (test) ARC-CEs has been extended (it now includes hone, biomed, mice, na62, superb, snoplus.)
Declared in the GOC DB
  • Thursday 15th August. Whole Site (whole of Tier1) At Risk during work to replace fan in UPS.
  • LCGCE12 (CE for SL6 test Queue on the production batch farm) is in a long Outage ready for decommissioning.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • The SL6 and "Whole Node" queues on the production batch service will be terminated. Multi-core jobs and those requiring SL6 can be run on the test Condor batch system.
  • Re-establishing the paired (2*10Gbit) link to the UKLight router.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Grid Services
    • Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
  • Infrastructure:
    • A 2-day maintenance is being planned for the first week in November (TBC) for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
      • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
      • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
      • Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
Entries in GOC DB starting between the 7th and 14th August 2013.

There was one unscheduled Warning in the GOC DB during this last week. This is for an intervention on a disk array behind the Atlas 3D service.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgft-atlas UNSCHEDULED WARNING 14/08/2013 09:00 14/08/2013 17:00 8 hours Atlas 3D service at RAL at Risk while a standby power supply is swapped in a disk array.
lcgce12.gridpp.rl.ac.uk, SCHEDULED OUTAGE 06/08/2013 13:00 05/09/2013 13:00 30 days, CE (and the SL6 batch queue behind it) being decommissioned.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
96079 Amber Less Urgent Waiting Reply 2013-08-08 2013-08-08 CMS Transfers from Caltech to RAL are failing
96321 Red Less Urgent Waiting Reply 2013-08-02 2013-08-06 SNO+ SNO+ srm tests failing
96235 Red Less Urgent In Progress 2013-07-29 2013-08-09 hyperk.org LFC for hyperk.org
96233 Red Less Urgent In Progress 2013-07-29 2013-08-09 hyperk.org WMS for hyperk.org - RAL
95996 Red Urgent In Progress 2013-07-22 2013-07-22 OPS SHA-2 test failing on lcgce01
91658 Red Less Urgent In Progress 2013-02-20 2013-08-09 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-17-06 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
07/08/13 100 100 100 100 100
08/08/13 100 100 100 95.9 100 Single SRM 'put' test failures "User timeout"
09/08/13 100 100 99.2 91.8 100 CMS: Multiple SRM 'put' test failures "User timeout"; Atlas: Single test failure (timeout) on SRM 'Get'
10/08/13 100 100 100 83.6 100 Multiple SRM 'put' test failures "User timeout"
11/08/13 100 100 100 79.6 100 Multiple SRM 'put' test failures "User timeout"
12/08/13 100 100 100 87.7 100 Multiple SRM 'put' test failures "User timeout"
13/08/13 100 100 100 95.6 100 Single SRM 'put' test failures "User timeout"