Tier1 Operations Report 2013-03-20

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 20th March 2013

Review of Issues during the week 13th to 20th March 2013.
  • On Friday afternoon (15th Mar) there was a problem that lasted about 6 minutes on our Tier1 network at approximately 15:40. This caused a spike in FTS transfer failures as well as some SUM test failures. In general services continues OK but staff made a check round for possible problems.
  • On Tuesday (19th Mar) at around midday there was a problem on the site network that lasted around 15 minutes. Tier1 services carried on running although there were some test failures at this time.
Resolved Disk Server Issues
  • GDSS519 (GenTape D0T1) was put into a draining mode following discovery of a single corrupt file last Wed morning (13th Mar). The server was checked out, confirming only one file was bad and revealing a faulty disk drive that was replaced. The server was returned to production later that day.
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
  • The problem LHCb and Atlas jobs failing due to long job set-up times remains. (The change to run jobs re-niced has not resolved the problem). Investigations continue.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
  • The change to the certificate used by the MyProxy server announced for Monday 18th Mar. had to be backed out. An alternatice solution to the MyProxy certificate problem reported in GGUS#92266 is being worked on.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • An analysis of data rates shows that the intervention on Tuesday 12th Mar (replacing the core C300 switch and modifying the link to the UKLight router) has resolved the problem of asymmetric data rates in/out of the Tier1.
  • The APEL publisher was upgraded from UMD-1 to UMD-2 last Thursday (14th Mar).
  • The Castor client software has been upgraded to version 2.1.13 on all worker nodes.
  • Updating disk controller firmware on Clustervision '11 batch of disk servers ongoing.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services
    • Upgrade of other EMI-1 components (APEL, UI) under investigation.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
    • Electrical safety check (will require some downtime).


Entries in GOC DB starting between 13 and 20th March 2013.

There were no unscheduled entries in the GOC DB for the last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgrbp01.gridpp.rl.ac.uk, SCHEDULED WARNING 18/03/2013 10:00 18/03/2013 11:00 1 hour Warning for hour following replacement of a certificate on the MyProxy server. (Ref GGUS ticket 92266)
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
92266 Amber Less Urgent In Progress 2013-03-06 2013-03-19 Certificate for RAL myproxy server
91974 Red Urgent In Progress 2013-03-04 2013-03-13 NAGIOS *eu.egi.sec.EMI-1* failed on lcgwms01.gridpp.rl.ac.uk@RAL-LCG2
91658 Red Less Urgent In Progress 2013-02-20 2013-03-13 LFC webdav support
91146 Red Urgent In Progress 2013-02-04 2013-03-14 Atlas RAL input bandwith issues
91029 Red Very Urgent On Hold 2013-01-30 2013-02-27 Atlas FTS problem in queryin jobs
86152 Red Less Urgent On Hold 2012-09-17 2013-03-19 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
13/03/13 100 100 100 100 100
14/03/13 100 100 100 100 100
15/03/13 100 100 99.2 91.8 95.8 Problem with Tier1 Network (packet storm) around 15:40 caused some failures. Also CMS had a single "SRM timeout" earlier in the day.
16/03/13 100 100 100 100 100
17/03/13 100 100 99.1 100 100 Single SRM Test failure "zero number of replicas"
18/03/13 100 100 100 100 100
19/03/13 100 100 96.2 95.9 100 Test failures (SRM & CE) around midday owing to Site Network problem. In addition Atlas suffered a few other (mainly) SRM test failures earlier in the day.