Difference between revisions of "Tier1 Operations Report 2013-07-10"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 14:33, 10 July 2013

RAL Tier1 Operations Report for 10th July 2013

Review of Issues during the week 3rd to 10th July 2013.
  • There have been some ongoing CVMFS problems - notably affecting CMS during this last week. Work is ongoing to investigate this, including upgrading to a version of CVMFS (2.1.12) that better handles fail-over.
  • On Friday (5th July) a problem arose with the Atlas Castor during the afternoon which was declared as down (outage in GOC DB) for 2.5 hours and At Risk for the rest of the weekend. This was traced to a recurrence of the problem of a week ago - and is a bug fixed in version 2.1.13.
  • The problem reported two weeks ago with the RAL site firewall logging of a large number of connection requests causing high load was seen elsewhere by ALICE. Although not understood & fixed a workaround is in place. We have raised the maximum number of Alice jobs from 500 to 1000.
  • We have had a number of occurrences of connections to the batch server failing - this is also manifest as the pbs_server using a lot of memory. We trap and call out on this memory problem, but only this last week have we seen connection problems linked to it. This problem was compounded over the weekend by our Nagios server not running this memory check for a long time (from Saturday evening until Monday). The cause of this is not understood - other Nagios tests were run but somehow this one wasn't scheduled. A restart of the Nagios process on on Monday resolved the issue. We failed some CE SUM tests over the weekend as a result of this problem.
Resolved Disk Server Issues
  • On Thursday (4th July) GDSS662 (AtlasDataDisk) became unresponsive and was taken out of production for investigation. Following hardware checks it was returned to service the following day, initially in 'passive draining' mode. This was changed to full production later that afternoon as there was a suggestion that servers in passive draining were implicated in exposing the bug seen in version 2.1.12.
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The problem LHCb jobs failing due to long job set-up times remains and investigations continue. Recent updates to the CVMFS clients have improved the situation for Atlas.
  • The testing of FTS3 is continuing and the service is being put on a more 'production' footing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Testing is ongoing with the proposed new batch system (ARC-CEs, Condor, SL6). Atlas and CMS running work through this. ALICE & LHCb being brought on-board with the testing.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • Castor Atlas instance (stager) being upgraded to version 2.1.13-9 today.
  • Some of the farm nodes have been upgrade to CVMFS version 2.1.12 - which resolves a fail-over problem seen in 2.1.11.
  • Some Atlas Tier2s (Manchester,ECDF, RAL PPD) have moved transfers to use FTS3.
Declared in the GOC DB
  • None.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • The upgrade of the Atlas Castor stager is taking place today. We plan to update the remaining stagers on the following dates: CMS & GEN: Tuesday 23rd July; LHCb Tuesday 30th July.
  • Re-establishing the paired (2*10Gbit) link to the UKLight router.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13 (ongoing)
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Grid Services
    • Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
  • Infrastructure:
    • A 2-day maintenance is being planned sometime in October or November for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
      • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
      • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
      • Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
Entries in GOC DB starting between 3rd and 10th July 2013.

There were three unscheduled entries in the GOC DB. Two were for the problem with the Atlas Castor instance on Friday (5th July) - for the outage and then a 'warning' over the weekend. The other was for an outage of the (test) ARC-CEs.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas SCHEDULED OUTAGE 10/07/2013 09:00 10/07/2013 14:00 5 hours Upgrade of Atlas Castor Stager to version 2.1.13-9.
arc-ce01, arc-ce02 UNSCHEDULED OUTAGE 09/07/2013 12:00 09/07/2013 15:15 3 hours and 15 minutes Problem on Hypervisor hosting these ARC-CEs.
srm-atlas UNSCHEDULED WARNING 05/07/2013 18:30 08/07/2013 09:30 2 days, 15 hours Following problems with Atlas Castor instance
srm-atlas UNSCHEDULED OUTAGE 05/07/2013 15:45 05/07/2013 18:17 2 hours and 32 minutes Problem with Atlas Castor instance. Investigations ongoing.
All Castor & batch SCHEDULED OUTAGE 03/07/2013 09:00 03/07/2013 12:30 3 hours and 30 minutes Castor nameserver Upgrade to version 2.1.13-9. Castor and batch services unavailable.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
95435 Yellow Urgent In Progress 2013-07-04 2013-07-04 LHCb CVMFS problem at RAL-LCG2
91658 Red Less Urgent In Progress 2013-02-20 2013-07-02 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-17-06 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
03/07/13 85.4 80.9 76.5 82.8 85.4 Scheduled Outage for Castor nameserver 2.1.13 upgrade. Also failed a few Atlas SRM tests (timeouts) and a few CE tests (CVMFS on particular nodes).
04/07/13 100 100 92.9 100 100 Mainly loss due to CE tests with CVMFS not functional on some nodes.
05/07/13 100 100 92.5 100 97.2 Atlas: Problem with Castor instance during afternoon. LHCb: Problem connecting to batch server
06/07/13 100 89.6 100 100 95.8 Problem connecting to batch server
07/07/13 100 89.8 94.1 100 84.3 Problem connecting to batch server
08/07/13 100 100 100 100 100
09/07/13 100 100 99.2 100 100 Single SRM test ailure (timeout).