Tier1 Operations Report 2013-04-24

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 24th April 2013

Review of Issues during the week 17th to 24th April 2013.
  • None.
Resolved Disk Server Issues
  • GDSS371 (AtlasTape - D0T1) failed during the evening of Tuesday 16th April. It was returned to production during the following afternoon.
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The problem LHCb and Atlas jobs failing due to long job set-up times remains and investigations continue.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
  • There is an outstanding problem (and GGUS ticket) affecting the certificate on the MyProxy server.
Ongoing Disk Server Issues
  • None.
Notable Changes made this last week
  • On Wednesday (17th April) a change was made to the way the batch scheduler fills job slots - as part of ongoing investigations into job set-up failures.
  • This morning (24th April) Oracle PSU patches applied to the databases behind LFC, FTS & Atlas 3D and the Castor standby databases.
  • On Thursday (18th April) the first three disk servers from the second batch of 2012 orders were put into production in AtlasDataDisk.
  • This morning a new top BDII node was added into the alias. This replaced a failed server and means there are again three nodes behind the Top-BDII alias.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Oracle patches will be applied to the main Castor databases during an At Risk next Wednesday (1st May).
  • A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing. (Alice disk servers in this batch remain to be done).

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services
    • Testing of alternative batch systems (e.g. SLURM).
    • Upgrade of other EMI-1 components (UI) under investigation.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
    • Electrical safety check. This will require significant (maybe 2 days) downtime.
Entries in GOC DB starting between 17th and 24th April 2013.

There were no unscheduled outages during the last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, SCHEDULED WARNING 24/04/2013 09:00 24/04/2013 13:00 4 hours Warning during application of Oracle paches to back-end databases behind FTS, LFC and Atlas 3D systems.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
93149 Red Less Urgent On Hold 2013-04-05 2013-04-08 Atlas RAL-LCG2: jobs failing with " cmtside command was timed out"
92266 Red Less Urgent Waiting for Reply 2013-03-06 2013-04-16 Certificate for RAL myproxy server
91658 Red Less Urgent On Hold 2013-02-20 2013-04-03 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-03-19 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
17/04/13 100 100 100 95.9 100 Single SRM test failure "user timeout"
18/04/13 100 100 100 95.9 100 Single SRM test failure "user timeout"
19/04/13 100 100 99.1 100 100 Single SRM test failure "user timeout"
20/04/13 100 100 99.1 100 100 Single SRM test failure "user timeout"
21/04/13 100 100 100 68.3 100 Problem with CMS's monitoring
22/04/13 100 100 100 95.9 100 Single SRM test failure "user timeout"
23/04/13 100 100 99.2 100 100 Single SRM test failure "user timeout"