Tier1 Operations Report 2013-04-03

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 3rd April 2013

Review of Issues during the fortnight 20th March to 3rd April 2013.
  • On Thursday morning, 21st March around 08:40 to 09:00 there was a networking problem that caused some transitory problems for the Tier1. The effect was seen in a single SUM test failure for Atlas.
  • Overnight Monday/Tuesday 25/26 March there was a failure of one of the three top BDII nodes. It was removed from the alias the following morning.
  • On Tuesday 26th Mar high CPU usage on the site firewall caused intermittent network problems affecting the Tier1 from ~08:00 – ~09:45. Services again experienced transitory failures. There were FTS failures & SAM test failures.
  • On Thursday afternoon, around 16:00, an operational error led to a networking break that caused some transitory problems for the Tier1.
  • Services ran well over the Easter weekend. There were a couple of problems that did not affect front line services although one of the perfsonar network monitoring systems went down.
Resolved Disk Server Issues
  • GDSS446 (AtlasDataDisk D1T0) was taken out of service after reporting FSProbe errors yesterday evening (2nd April). A disk drive in the server has been replaced and it was returned to service around 12:25 today.
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
  • The problem LHCb and Atlas jobs failing due to long job set-up times remains. A different version of CVMFS has been installed as a test and investigations continue.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
  • There is an outstanding problem (and GGUS ticket) affecting the cerificate on the MyProxy server.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • Updating disk controller firmware on Clustervision '11 batch of disk servers ongoing.
  • Kernel/errata updates and removal of AFS software (as opposed to just disabling) being done across worker nodes.
  • Removal of EMI-1 WMS systems (WMS01,02,03). (Disabled in GOC DB).
Declared in the GOC DB
  • This evening (Wed 3rd March 18:00 - 23:59 BST) Emergency maintenance in Geneva affecting both the main and backup links to CERN. No outage is expected during this maintenance. Services are considered at risk only.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Next Tuesday (9th April) morning a networking update will cause a couple of short breaks in external connectivity as switches are rebooted.
  • One of the disk arrays hosting the LFC/FTS/3D databases has given some errors and an intervention will be necessary requiring a stop of these services. We are checking how long this will take ahead of making an announcement.
  • A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services
    • Upgrade of other EMI-1 components (APEL, UI) under investigation.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
    • Electrical safety check (will require some downtime).


Entries in GOC DB starting between 20th March and 3rd April 2013.

There were no unscheduled entries in the GOC DB for the last fortnight.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgwms01, lcgwms02, lcgwms03 SCHEDULED OUTAGE 22-03-2013 11.00.00 15-04-2013 12.00.00 24 days EMI-1 WMS service retirement
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
92266 Amber Less Urgent In Progress 2013-03-06 2013-03-28 Certificate for RAL myproxy server
91974 Red Urgent In Progress 2013-03-04 2013-04-03 NAGIOS *eu.egi.sec.EMI-1* failed on lcgwms01.gridpp.rl.ac.uk@RAL-LCG2
91658 Red Less Urgent On Hold 2013-02-20 2013-04-03 LFC webdav support
91029 Red Very Urgent On Hold 2013-01-30 2013-02-27 Atlas FTS problem in queryin jobs
86152 Red Less Urgent On Hold 2012-09-17 2013-03-19 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
20/03/13 100 100 100 100 100
21/03/13 100 100 98.7 100 100 Single SRM test failure at time of network problem.
22/03/13 100 100 100 100 100
23/03/13 100 100 100 100 100
24/03/13 100 100 100 100 100
25/03/13 100 100 100 100 100
26/03/13 100 100 94.2 100 100 Network problem triggered by site firewall overload.
27/03/13 100 100 100 100 100
28/03/13 100 100 100 100 100
29/03/13 100 100 100 95.9 100 Single SRM Put failure (user timeout).
30/03/13 100 100 100 100 100
31/03/13 100 100 98.1 100 100 Two consecutive failures of SRM Put test. "could not open connection to srm-atlas.gridpp.rl.ac.uk"
01/04/13 100 100 100 100 100
02/04/13 100 100 99.1 100 100 Single SUM test failure of SRM delete.