Tier1 Operations Report 2013-06-12

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 12th June 2013

Review of Issues during the fortnight 29th May to 12th June 2013.
  • Very high load was seen on some disk servers in CMS disk during the first part of last week (2-4 June). The Castor team made some tuning changes and CMS reduced the load on the disk pool to resolve the issue.
  • On Tuesday 4th June there was a load test of the UPS/Generator. The test ran into problems when a circuit breaker failed to close. Cooling was stopped for around 20 minutes. One batch of WNs was manually stopped in response.
  • There was a problem with OPS test availabilities for the Site BDII on Monday/Tuesday (10/11 June) when the test ARC-CEs were added into the BDII. These CEs were subsequently removed from the BDII information but it took some time for the tests to clear.
  • There are ongoing intermittent problems starting LHCb batch jobs.
Resolved Disk Server Issues
  • GDSS713 (CMSDisk - D1T0) crashed on the morning of Thursday 30th May. It was returned to service the following morning (31st May). No hardware faults were found during testing.
Current operational status and issues
  • Following the failure of the UPS/Generator load test on 4th June we are currently running without generator backup.
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The problem LHCb jobs failing due to long job set-up times remains and investigations continue. Recent updates to the CVMFS clients have improved the situation for Atlas.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • On Thursday (6th June) the Atlas 3D database ("Ogma") was unavailable for around 90 minutes while a re-configuration of the Oracle voting disk was made.
  • Work has been ongoing testing newer versions of CVMFS to investigate the job set-up problems.
Declared in the GOC DB
  • None.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Following on from the failure of the UPS/Generator load test a failed battery on a control board needs to be replaced. Once that has been done the test will be re-scheduled.
  • Re-establishing the paired (2*10Gbit) link to the UKLight router. (Aiming to do in next weeks).
  • The problem reported last week following the upgrade of the non-Tier1 'facilities' Castor instance to version 2.1.13 is now understood and fixed. We will continue to monitor this closely ahead of re-scheduling the upgrade of Tier1 Castor instances.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Grid Services
    • Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
    • Upgrade of one remaining EMI-1 component (UI) being planned.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
  • Infrastructure:
    • A 2-day maintenance is being planned sometime in October or November for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
      • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
      • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
      • Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
Entries in GOC DB starting between 29th May and 12th June 2013.

There were no unscheduled entries in the GOC DB starting during the last fortnight.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site SCHEDULED WARNING 04/06/2013 10:00 04/06/2013 12:00 2 hours Warning (At Risk) during test of UPS generator.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
94755 Green Urgent Waiting Reply 2013-06-10 2013-06-11 Error retrieving data from lcgwms04
94731 Green Less Urgent In Progress 2013-06-07 2013-06-10 cernatschool WMS for cernatschool.org
94543 Red Less Urgent Waiting Reply 2013-06-04 2013-06-11 SNO+ Job outputs not being retrieved
91658 Red Less Urgent On Hold 2013-02-20 2013-05-29 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-03-19 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
29/05/13 100 100 100 100 100
30/05/13 100 100 100 100 100
31/05/13 100 100 100 100 100
01/06/13 100 100 100 100 100
02/06/13 100 100 99.0 100 100 Single SRM Put test failure "Zero number of replicas"
03/06/13 100 100 100 100 100
04/06/13 100 100 100 100 100
05/06/13 100 100 98.2 100 100 Two separate test failures. ("Zero number of replicas", "User timeout").
06/06/13 100 100 99.1 100 100 Single test failure to delete a file.
07/06/13 100 100 100 100 100
08/06/13 100 100 100 100 100
09/06/13 100 100 98.2 96.0 100 Atlas: Several failures to delete the test file. CMS: Single failure to get a file.
10/06/13 49.4 100 100 100 100 ARC-CEs, which are under test, were added to the BDII. However, the data provided was not complete and we failed some BDII sanity checking until they were removed.
11/06/13 89.4 100 98.4 100 100 OPS test: Continuation of ARC-CE/BDII issue. Although fixed much earlier (during previous working day), test didn't clear until the early hours of the morning. Atlas: failures to connect to SRM. Probably a problem elsewhere as a few other sites (including a couple of Tier1s) see the same error at roughly the same time.