Tier1 Operations Report 2013-04-17

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 17th April 2013

Review of Issues during the week 10th to 17th April 2013.
  • On Thursday (11th) two interventions were made on the Tier1 network to fix links with high error rates. One was traced to badly seated cable, the other appears to be a faulty fibre cable. This latter cable provides the link to some of the newly installed equipment (including recently deployed Atlas & CMS disk servers). There was a short (10 minute) break in connectivity to these servers as the faulty cable was by-passsed.
  • On Thursday (11th) there was a planned intervention on the databse behind the FTS & LFC services. These services were stopped for around an hour.
  • On Friday (12th) a configuration error caused an update to the sudoers file that was incompatible with the CEs. For around an hour or so (until the problem was fixed) we were not able to start batch jobs.
  • On Friday (12th) afternoon there was a problem with the CRLs on a specific batch worker node.
  • On Tuesday (16th) there was a short (15 - 20 minute) stop of the Atlas Software server, during which we stopped Atlas batch jobs starting. This nodes is a 'twin' with another node (one of the BDIIs) that had shown problems and required investigating.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The problem LHCb and Atlas jobs failing due to long job set-up times remains and investigations continue.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
  • There is an outstanding problem (and GGUS ticket) affecting the certificate on the MyProxy server.
Ongoing Disk Server Issues
  • GDSS371 (AtlasTape - D0T1) failed yesterday evening (16th April). There are six files awaiting migration to tape not available. Expect server back by end of this afternoon.
Notable Changes made this last week
  • On Thursday (11th) the 'Somnus' (LFC & FTS) database was 'rebalanced'. During this the data which had been spread across two disk arrays was consolidated onto one of the arrays. The other array has been reporting some errors and this change paves the way for an intervention to take place on the faulty array.
  • Kernel/errata updates and removal of AFS software (as opposed to just disabling) completed across worker nodes.
  • The second of the two batches of new worker nodes (the one from Viglen) has been deployed into production. All the 2012 CPU purchase is now in service and the batch farm currently has over 10,000 job slots.
  • A dozen batch farm nodes have been reserved for for CMS disk/tape separation testing.
  • This morning (17th April) the batch server was upgraded to UMD-2. (Note that this does not alter the versions of torque/maui running on the server.)
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing. (Alice disk servers in this batch remain to be done).

Listing by category:

  • Databases:
    • Apply latest Oracle 'PSU' patches.
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services
    • Testing of alternative batch systems (e.g. SLURM).
    • Upgrade of other EMI-1 components (APEL, UI) under investigation.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
    • Electrical safety check. This will require significant (maybe 2 days) downtime.
Entries in GOC DB starting between 10th and 17th April 2013.

There were no unscheduled outages during the last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk SCHEDULED OUTAGE 11/04/2013 10:00 11/04/2013 11:05 1 hour and 5 minutes Outage of LFC and FTS services. The Oracle database behind these services uses two disk arrays. One of the arrays is reporting errors and the database will be reconfigured (rebalanced) to move the data off the faulty array. FTS transfers will be drained before the outage.
Whole site SCHEDULED WARNING 10/04/2013 18:00 11/04/2013 00:00 6 hours An emergency maintenance has been announced for both the main and backup OPN links RAL - CERN.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
93315 Green Urgent Waiting Reply 2013-04-13 2013-04-15 Atlas "Checksum mismatch" error at site RAL-LCG2
93149 Red Less Urgent On Hold 2013-04-05 2013-04-08 Atlas RAL-LCG2: jobs failing with " cmtside command was timed out"
93136 Red Less Urgent In Progress 2013-04-05 2013-04-15 EPIC Problems downloading job output using RAL WMS (epic VO)
92266 Red Less Urgent In Progress 2013-03-06 2013-04-16 Certificate for RAL myproxy server
91658 Red Less Urgent On Hold 2013-02-20 2013-04-03 LFC webdav support
91029 Red Very Urgent On Hold 2013-01-30 2013-02-27 Atlas FTS problem in queryin jobs
86152 Red Less Urgent On Hold 2012-09-17 2013-03-19 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
10/04/13 100 100 100 100 100
11/04/13 100 100 100 95.9 100 Single SRM test failure "user timeout"
12/04/13 96.2 91.3 93.9 100 95.5 failed CE tests after configuration error led to problem with sudo.
13/04/13 100 100 100 95.9 100 Single SRM test failure "user timeout"
14/04/13 100 100 100 95.9 100 Single SRM test failure "user timeout"
15/04/13 100 100 100 100 100
16/04/13 100 100 100 95.9 100 Single SRM test failure "user timeout"