Tier1 Operations Report 2013-06-26

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 26th June 2013

Review of Issues during the week 19th to 26th June 2013.
  • Last week's report referred to an announced scheduled maintenance on both CERN Primary & Backup links overnight Tue/Wed 25/26 June. This was subsequently clarified to only affect the Backup Link - and no break in the connectivity of the Primary Link was seen during the announced time window.
  • There was a problem with the Atlas SRM database late on Saturday evening (22nd June). This was resolved around midnight by the call-out team. The problem persisted a few hours in total.
  • There have been further problems with the Atlas SRM database this morning (an Unscheduled Outage has been declared in the GOC DB.) One of the Oracle RAC nodes became unstable.
  • A problem was seen on the RAL site firewall since Saturday (22nd) with the logging of a large number of connection requests causing high load. This was traced to the ALICE file sharing system making many connections outbound. The number of ALICE jobs has been being restricted temporarily - which has eased the problem while a better fix is decided on.
  • Problems reported last week of intermittent problems starting LHCb batch jobs have largely disappeared this week. However, the cause is not fully understood.
  • The issue of LHCb CE tests ending up in the 'whole node' queue (reported last week) is now understood. Discussions with LHCb ongoing.
Resolved Disk Server Issues
  • Disk Server GDSS720 (AtlasDataDisk - D1T0) has been drained completely following its failure a couple of weeks ago. The server is out of production for a harware fix & testing.
Current operational status and issues
  • The successful UPS/Generator load test yesterday gives us much more confidence this system would work if there were to be a power failure.
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The problem LHCb jobs failing due to long job set-up times remains and investigations continue. Recent updates to the CVMFS clients have improved the situation for Atlas.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Testing is ongoing with the proposed new batch system (ARC-CEs, Condor, SL6). Atlas and CMS running work through this. ALICE & LHCb being brought on-board with the testing.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • There was a successful UPS/Generator load test yesterday (25th June).
  • Started to roll out new EMI-3 site and top BDIIs on SL6.4.
  • A new SL6/EMI-3 UI has been set-up.
Declared in the GOC DB
  • None.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • The first part of the Castor 2.1.13 upgrade, updating the Castor Nameserver, is being planned for next Wednesday (3rd July) and will entail a complete Castor stop. The Castor Stagers for the individual instances will be upgraded in the following weeks.
  • The test ARC-CEs will be added into the BDII tomorrow morning (27th June).
  • Re-establishing the paired (2*10Gbit) link to the UKLight router. (Aiming to do in next weeks).

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Grid Services
    • Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
  • Infrastructure:
    • A 2-day maintenance is being planned sometime in October or November for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
      • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
      • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
      • Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
Entries in GOC DB starting between 19th and 26th June 2013.

There was one unscheduled entries in the GOC DB - for the Atlas SRM problems this morning.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas UNSCHEDULED OUTAGE 26/06/2013 10:40 26/06/2013 11:50 2 hours Problem with Database Behind Atlas SRM.
Whole Site SCHEDULED WARNING 25/06/2013 10:00 25/06/2013 11:20 1 hour and 20 minutes At Risk during UPS/Generator load test.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
95104 Green Less Urgent On Hold 2013-06-26 2013-06-26 CMS glidein Hammer Cloud problem at T1_UK_RAL
91658 Red Less Urgent On Hold 2013-02-20 2013-06-19 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-17-06 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
19/06/13 100 100 99.1 100 100 Single failure on SRM 'GET'. Couldn't contact disk server.
20/06/13 100 100 100 95.9 100 Single failure of SRM PUT test. (Timeout.)
21/06/13 100 100 100 100 100
22/06/13 100 100 85.9 100 100 Problem with database behind the Atlas SRM. (Started late evening).
23/06/13 100 100 97.7 100 100 Tail end of above problem.
24/06/13 100 100 100 100 100
25/06/13 100 100 99.2 100 100 Single SRM test failure - failed to delete a file.